How to Read Contents of an Table in Ms-Word File Using Python

How to read contents of an Table in MS-Word file Using Python?

Here is what works for me in Python 2.7:

import win32com.client as win32
word = win32.Dispatch("Word.Application")
word.Visible = 0
word.Documents.Open("MyDocument")
doc = word.ActiveDocument

To see how many tables your document has:

doc.Tables.Count

Then, you can select the table you want by its index. Note that, unlike python, COM indexing starts at 1:

table = doc.Tables(1)

To select a cell:

table.Cell(Row = 1, Column= 1)

To get its content:

table.Cell(Row =1, Column =1).Range.Text

Hope that this helps.

EDIT:

An example of a function that returns Column index based on its heading:

def Column_index(header_text):
for i in range(1 , table.Columns.Count+1):
if table.Cell(Row = 1,Column = i).Range.Text == header_text:
return i

then you can access the cell you want this way for example:

table.Cell(Row =1, Column = Column_index("The Column Header") ).Range.Text

Python: How to read a table from word when it is in a text box?

I wrote a solution using another python package docx2python.

from docx2python import docx2python
doc = docx2python(word_document_path)
doc_body = doc.body
table = doc_body[table_number]
table = pd.DataFrame(table)

How do you read a table from a certain part in a word document using python-docx?

The code here may be of interest: https://github.com/python-openxml/python-docx/issues/276#issuecomment-199502885.

What you're looking for, I believe, is a way to iterate the block level items in a document, in the order they appear. A Word document has two types of block-level items, paragraphs and tables. The function at the link above allows you to iterate those in document order.

Reading Table Contet In Header And Footer In MS-Word File Using Python

Accessing Headers and Footers is a bit tricky. Here is how to do it:

HeaderTable = doc.Sections(1).Headers(1).Range.Tables(1)
FooterTable = doc.Sections(1).Footers(1).Range.Tables(1)

You can get the table count this way:

HeaderTablesCount = doc.Sections(1).Headers(1).Range.Tables.Count
FooterTablesCount = doc.Sections(1).Footers(1).Range.Tables.Count

And get the text from cells this way:

HeaderTable.Cell(1,1).Range.Text
FooterTable.Cell(1,1).Range.Text

How to extract a Word table from multiple files using python docx

You are reinitializing the data list to [] (empty) for every document. So you carefully collect the row-data from a document and then in the next step throw it away.

If you move data = [] outside the loop then after iterating through the documents it will contain all the extracted rows.

data = []

for name in filenames:
...
data.append(row_data)

print(data)

python -docx to extract table from word docx

Your code works fine for me. How about inserting it into a dataframe?

import pandas as pd
from docx.api import Document

document = Document('test_word.docx')
table = document.tables[0]

data = []

keys = None
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)

if i == 0:
keys = tuple(text)
continue
row_data = dict(zip(keys, text))
data.append(row_data)
print (data)

df = pd.DataFrame(data)

How can i display particular row and column in that table?
We can extract rows and cols based on index with iloc

# iloc[row,columns] 
df.iloc[0,:].tolist() # [5,6,7,8] - row index 0
df.iloc[:,0].tolist() # [5,9,13,17] - column index 0
df.iloc[0,0] # 5 - cell(0,0)
df.iloc[1:,2].tolist() # [11,15,19] - column index 2, but skip first row

and so on...

However, if your columns have names (in this case it is numbers) you can do it like this:

#df["name"].tolist() 
df[1].tolist() # [5,6,7,8] - column with name 1

print(df)

prints, which is how the table looks like in my sample doc.

    1   2   3   4
0 5 6 7 8
1 9 10 11 12
2 13 14 15 16
3 17 18 19 20

How to extract text data in a table created in a docx document

Try using python-docx module instead

pip install python-docx

import docx

doc = docx.Document("document.docx")

for table in doc.tables:
for row in table.rows:
for cell in row.cells:
print(cell.text)


Related Topics



Leave a reply



Submit