Reading/Writing Ms Word Files in Python

Reading/Writing MS Word files in Python

I'd look into IronPython which intrinsically has access to windows/office APIs because it runs on .NET runtime.

extracting text from MS word files in python

You could make a subprocess call to antiword. Antiword is a linux commandline utility for dumping text out of a word doc. Works pretty well for simple documents (obviously it loses formatting). It's available through apt, and probably as RPM, or you could compile it yourself.

How to read contents of an Table in MS-Word file Using Python?

Here is what works for me in Python 2.7:

import win32com.client as win32
word = win32.Dispatch("Word.Application")
word.Visible = 0
word.Documents.Open("MyDocument")
doc = word.ActiveDocument

To see how many tables your document has:

doc.Tables.Count

Then, you can select the table you want by its index. Note that, unlike python, COM indexing starts at 1:

table = doc.Tables(1)

To select a cell:

table.Cell(Row = 1, Column= 1)

To get its content:

table.Cell(Row =1, Column =1).Range.Text

Hope that this helps.

EDIT:

An example of a function that returns Column index based on its heading:

def Column_index(header_text):
for i in range(1 , table.Columns.Count+1):
    if table.Cell(Row = 1,Column = i).Range.Text == header_text:
        return i

then you can access the cell you want this way for example:

table.Cell(Row =1, Column = Column_index("The Column Header") ).Range.Text

How to read docx originated from Word templates with python-docx?

python-docx will only find paragraphs and tables at the top-level of the document. In particular, paragraphs or tables "wrapped" in a "container" element will not be detected.

Most commonly, the "container" is a pending (not yet accepted) revision and this produces a similar behavior.

To extract the "wrapped" text, you'll need to know what the "wrapper" elements are. One way to do that is by dumping the XML of the document body:

document = Document("my-document.docx")
print(document._body._body.xml)

A paragraph element has a w:p tag and you can inspect the output to look for those, some of which I expect will be inside another element.

Then you can extract those elements with XPath expressions, something like this, which would work if the "wrapper" element was <w:x>:

from docx.text.paragraph import Paragraph

body = document._body._body
ps_under_xs = body.xpath("w:x//w:p")
for p in ps_under_xs:
    paragraph = Paragraph(p, None)
    print(paragraph.text)

You could also just get all the <w:p> elements in the document, without regard to their "parentage" with something like this:

ps = body.xpath(".//w:p")

The drawback of this is that some containers (like unaccepted revision marks) can contain text that has been "deleted" from the document, so you might get more than what you wanted.

In any case, this general approach should work for the job you've described. You can find more about XPath expressions on search if you need something more sophisticated.

Read and Write .docx file with python

from docx import *
document = opendocx("document.doc")
body = document.xpath('/w:document/w:body', namespaces=nsprefixes)[0]
body.append(paragraph('Appending this.'))

The second line may need to chance depending on where in the file you are going to append the text. To finish this, you will need to use the savedocx() function, and there is an example of its usage in the root of the project.

Reading/Writing Ms Word Files in Python