How to Search a Word in a Word 2007 .Docx File

How can I search a word in a Word 2007 .docx file?

More exactly, a .docx document is a Zip archive in OpenXML format: you have first to uncompress it.

I downloaded a sample (Google: some search term filetype:docx) and after unzipping I found some folders. The word folder contains the document itself, in file document.xml.

How to search multiple DOCX files for a string within a Word field?

This script should accomplish what you are trying to do. Let me know if that isn't the case. I don't usually write entire scripts because it can hurt the learning process, so I have commented each command so that you might learn from it.

#!/bin/sh

# Create ~/tmp/WORDXML folder if it doesn't exist already
mkdir -p ~/tmp/WORDXML

# Change directory to ~/tmp/WORDXML
cd ~/tmp/WORDXML

# Iterate through each file passed to this script
for FILE in $@; do
{
# unzip it into ~/tmp/WORDXML
# 2>&1 > /dev/null discards all output to the terminal
unzip $FILE 2>&1 > /dev/null

# find all of the xml files
find -type f -name '*.xml' | \

# open them in xmllint to make them pretty. Discard errors.
xargs xmllint --recover --format 2> /dev/null | \

# search for and report if found
grep 'getProposalTranslations' && echo " [^ found in file '$FILE']"

# remove the temporary contents
rm -rf ~/tmp/WORDXML/*

}; done

# remove the temporary folder
rm -rf ~/tmp/WORDXML

Save the script wherever you like. Name it whatever you like. I'll name it docxfind. Make it executable by running chmod +x docxfind. Then you can run the script like this (assuming your terminal is running in the same directory): ./docxfind filenames...

Search word in word documents and print out the file name that contains that word?

There may be a logic issue in your code.

Try this update:

import os
import docx2txt
os.chdir('C:/Users/epicr/Desktop/Python Stuff/LAB FILES')
text= ''
files = []
for file in os.listdir('C:/Users/epicr/Desktop/Python Stuff/LAB FILES'):
if file.endswith('.docx'):
files.append(file)
for i in range(len(files)):
text = docx2txt.process(files[i]) # text for single file
if 'VENTILATION RATIO' in text:
print (i, files[i]) # file index and name

How to use python-docx to replace text in a Word document and save

UPDATE: There are a couple of paragraph-level functions that do a good job of this and can be found on the GitHub site for python-docx.

  1. This one will replace a regex-match with a replacement str. The replacement string will appear formatted the same as the first character of the matched string.
  2. This one will isolate a run such that some formatting can be applied to that word or phrase, like highlighting each occurence of "foobar" in the text or perhaps making it bold or appear in a larger font.

The current version of python-docx does not have a search() function or a replace() function. These are requested fairly frequently, but an implementation for the general case is quite tricky and it hasn't risen to the top of the backlog yet.

Several folks have had success though, getting done what they need, using the facilities already present. Here's an example. It has nothing to do with sections by the way :)

for paragraph in document.paragraphs:
if 'sea' in paragraph.text:
print paragraph.text
paragraph.text = 'new text containing ocean'

To search in Tables as well, you would need to use something like:

for table in document.tables:
for row in table.rows:
for cell in row.cells:
for paragraph in cell.paragraphs:
if 'sea' in paragraph.text:
paragraph.text = paragraph.text.replace("sea", "ocean")

If you pursue this path, you'll probably discover pretty quickly what the complexities are. If you replace the entire text of a paragraph, that will remove any character-level formatting, like a word or phrase in bold or italic.

By the way, the code from @wnnmaw's answer is for the legacy version of python-docx and won't work at all with versions after 0.3.0.



Related Topics



Leave a reply



Submit