How to Extract Text from an Existing Docx File Using Python-Docx

How to extract text from under headings in a docx file using python

You have a way for this.

After extracting the contents, just mark the sections which have "Normal" case and "BOLD" as headings too. But you have to put this logic carefully in such a way that bold characters which are present inside normal paragraphs are not impacted i.e. (bold characters which are present inside a normal paragraph just to highlight an important term in that paragraph).

You can do this by scanning through each paragraph, and then iterating through all runs of the paragraph to check if "All the runs in that paragraph are BOLD". So if all the runs in a particular "Normal" paragraph have their property as "BOLD", you can conclude that it is a "Heading".

To apply the above logic, you can use the below code while iterating on the paragraphs of your document:

#Iterate over paragraphs
for paragraph in document.paragraphs:

#Perform the below logic only for paragraph content which does not have it's native style as "Heading"
if "Heading" not in paragraph.style.name:

#Start of by initializing an empty string to store bold words inside a run
runboldtext = ''

# Iterate over all runs of the current paragraph and collect all the words which are bold into the varible "runboldtext"
for run in paragraph.runs:
if run.bold:
runboldtext = runboldtext + run.text

# Now check if the value of "runboldtext" matches the entire paragraph text. If it matches, it means all the words in the current paragraph are bold and can be considered as a heading
if runboldtext == str(paragraph.text) and runboldtext != '':
print("Heading True for the paragraph: ",runboldtext)
style_of_current_paragraph = 'Heading'

python: find numbers in docx file and replace

You can use the docx library to read the content of .docx files.

pip install python-docx

Adapting some code from here and combining with the code you posted I got:

import docx

def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)

text = getText('Doc1.docx')

a = [int(s) for s in text.split() if s.isdigit()]

which worked for me with a simple test file - although you may need to adjust some parts depending on how you want the search for numbers to work.

how to extract text from docx files contaning in different folders

u can use glob.glob to get a list of all files from subdirectories

files = [file for file_list in [glob.glob('/path/to/mainfolder/**/{}'.format(x),recursive=True) for x in ('*.doc','*.docx')] for file in file_list]

with open('your_file.txt', 'w') as f:
for file in files:
document = docx.Document(filename)
for paragraph in document.paragraphs:
if paragraph.text:
f.write("%s\n" % item)

Python3 Docx get text between 2 paragraphs

If you loop over all paragraphs and print paragraphs texts you get the document text as is - but the single p.text of your loop does not contain the full documents text.

It works with a string:

t = """Foo :

The foo is not easy, but we have to do it. We are looking for new things in our ad libitum way of life.

Bar :"""

import re

fread = re.search(r'Foo\s*:(.*)\s*Bar', t)

print(fread) # None - because dots do not match \n

fread = re.search(r'Foo\s*:(.*)\s*Bar', t, re.DOTALL)

print(fread)
print(fread[1])

Output:

<_sre.SRE_Match object; span=(0, 115), match='Foo :\n\nThe foo is not easy, but we have to do i>


The foo is not easy, but we have to do it. We are looking for new things in our ad libitum way of life.

If you use

for p in reader.paragraphs:
print("********")
print(p.text)
print("********")

you see why your regex wont match. Your regex would work on the whole documents text.

See How to extract text from an existing docx file using python-docx how to get the whole docs text.

You could as well look for a paragraph that matches r'Foo\s*:' - then put all following paragraph.text's into a list until you hit a paragraph that matches r'\s*Bar'.



Related Topics



Leave a reply



Submit