Extracting Text from Ms Word Files in Python

Best way to extract text from a Word doc without using COM/automation?

I use catdoc or antiword for this, whatever gives the result that is the easiest to parse. I have embedded this in python functions, so it is easy to use from the parsing system (which is written in python).

import os

def doc_to_text_catdoc(filename):
    (fi, fo, fe) = os.popen3('catdoc -w "%s"' % filename)
    fi.close()
    retval = fo.read()
    erroroutput = fe.read()
    fo.close()
    fe.close()
    if not erroroutput:
        return retval
    else:
        raise OSError("Executing the command caused an error: %s" % erroroutput)

# similar doc_to_text_antiword()

The -w switch to catdoc turns off line wrapping, BTW.

extracting text from MS word files in python

You could make a subprocess call to antiword. Antiword is a linux commandline utility for dumping text out of a word doc. Works pretty well for simple documents (obviously it loses formatting). It's available through apt, and probably as RPM, or you could compile it yourself.

How to extract text from a word document in python? (and put the data in df)

You can use re.search().

If your document is str type, try out the following code.

import re

value_match = re.search('Euro __ (.*)excl', document)
value = value_match.group(1).strip()

date_match = re.search('Date:(.*)', document)
date = date_match.group(1).strip()

print(f"Value: {value}, Date: {date}")

Output:

Value: 191,250.00, Date: 24th June 2021

python - extract text from microsoft word

You may try first reading all document/paragraph text into a single string, and then using re.findall to find all matching text in between the target tags:

text = ""
for para in document.paragraphs:
    text += para.text + "\n"

matches = re.findall(r'-- ASN1START\s*(.*?)\s*-- ASN1STOP', text, flags=re.DOTALL)

Note that we use DOT ALL mode with the regex to ensure that .* can match content in between the tags which occurs across newlines.

Extraction of text page by page from MS word docx file using python

I found that Tika library had a xmlContent parsing when reading the file. I used it to capture xml format and used regex to capture it. Writing below the python code that worked for me.

raw_xml = parser.from_file(file, xmlContent=True)
body = raw_xml['content'].split('<body>')[1].split('</body>')[0]
body_without_tag = body.replace("<p>", "").replace("</p>", "").replace("<div>", "").replace("</div>","").replace("<p />","")
text_pages = body_without_tag.split("""<div class="page">""")[1:]
num_pages = len(text_pages)
if num_pages==int(raw_xml['metadata']['xmpTPg:NPages']) : #check if it worked correctly
     return text_pages

Extracting text from MS Word Document uploaded through FileUpload from ipyWidgets in Jupyter Notebook

Modern ms-word files (.docx) are actually zip-files.

The text (but not the page headers) are actually inside an XML document called word/document.xml in the zip-file.

The python-docx module can be used to extract text from these documents. It is mainly used for creating documents, but it can read existing ones. Example from here.

>>> import docx
>>> gkzDoc = docx.Document('grokonez.docx')

>>> fullText = []
>>> for paragraph in doc.paragraphs:
...     fullText.append(paragraph.text)
...

Note that this will only extract the text from paragraphs. Not e.g. the text from tables.

Edit:

I want to be able to upload the MS file through the FileUpload widget.

There are a couple of ways you can do that.

First, isolate the actual file data. upload.data is actually a dictionary, see here. So do something like:

rawdata = upload.data[0]

(Note that this format has changed over different version of ipywidgets. The above example is from the documentation of the latest version. Read the relevant version of the documentation, or investigate the data in IPython, and adjust accordingly.)

write rawdata to e.g. foo.docx and open that. That would certainly work, but it does seem somewhat un-elegant.
docx.Document can work with file-like objects. So you could create an io.BytesIO object, and use that.

Like this:

foo = io.BytesIO(rawdata)
doc = docx.Document(foo)

Extracting MS Word document formatting elements along with raw text information

The character level formatting ("font") properties are available at the run level. A paragraph is made up of runs. So you can get what you want by going down to that level, like:

for paragraph in document.paragraphs:
    for run in paragraph.runs:
        font = run.font
        is_bold = font.bold
        etc.

The biggest problem you're likely to encounter with that is that the run only knows about formatting that's been directly applied to it. If it looks the way it does because a style has been applied to it, you would have to query the style (which also has a font object) to see what properties it has.

Note that the python-docx that Mike was talking about is the legacy version which was completely rewritten after v0.2.0 (now 0.8.6). Docs are here: http://python-docx.readthedocs.org/en/latest/