Best way to extract text from a Word doc without using COM/automation?
I use catdoc or antiword for this, whatever gives the result that is the easiest to parse. I have embedded this in python functions, so it is easy to use from the parsing system (which is written in python).
import os
def doc_to_text_catdoc(filename):
(fi, fo, fe) = os.popen3('catdoc -w "%s"' % filename)
fi.close()
retval = fo.read()
erroroutput = fe.read()
fo.close()
fe.close()
if not erroroutput:
return retval
else:
raise OSError("Executing the command caused an error: %s" % erroroutput)
# similar doc_to_text_antiword()
The -w switch to catdoc turns off line wrapping, BTW.
extracting text from MS word files in python
You could make a subprocess call to antiword. Antiword is a linux commandline utility for dumping text out of a word doc. Works pretty well for simple documents (obviously it loses formatting). It's available through apt, and probably as RPM, or you could compile it yourself.
How to extract text from a word document in python? (and put the data in df)
You can use re.search()
.
If your document is str
type, try out the following code.
import re
value_match = re.search('Euro __ (.*)excl', document)
value = value_match.group(1).strip()
date_match = re.search('Date:(.*)', document)
date = date_match.group(1).strip()
print(f"Value: {value}, Date: {date}")
Output:
Value: 191,250.00, Date: 24th June 2021
python - extract text from microsoft word
You may try first reading all document/paragraph text into a single string, and then using re.findall
to find all matching text in between the target tags:
text = ""
for para in document.paragraphs:
text += para.text + "\n"
matches = re.findall(r'-- ASN1START\s*(.*?)\s*-- ASN1STOP', text, flags=re.DOTALL)
Note that we use DOT ALL mode with the regex to ensure that .*
can match content in between the tags which occurs across newlines.
Extraction of text page by page from MS word docx file using python
I found that Tika library had a xmlContent parsing when reading the file. I used it to capture xml format and used regex to capture it. Writing below the python code that worked for me.
raw_xml = parser.from_file(file, xmlContent=True)
body = raw_xml['content'].split('<body>')[1].split('</body>')[0]
body_without_tag = body.replace("<p>", "").replace("</p>", "").replace("<div>", "").replace("</div>","").replace("<p />","")
text_pages = body_without_tag.split("""<div class="page">""")[1:]
num_pages = len(text_pages)
if num_pages==int(raw_xml['metadata']['xmpTPg:NPages']) : #check if it worked correctly
return text_pages
Extracting text from MS Word Document uploaded through FileUpload from ipyWidgets in Jupyter Notebook
Modern ms-word files (.docx
) are actually zip-files.
The text (but not the page headers) are actually inside an XML document called word/document.xml
in the zip-file.
The python-docx
module can be used to extract text from these documents. It is mainly used for creating documents, but it can read existing ones. Example from here.
>>> import docx
>>> gkzDoc = docx.Document('grokonez.docx')
>>> fullText = []
>>> for paragraph in doc.paragraphs:
... fullText.append(paragraph.text)
...
Note that this will only extract the text from paragraphs. Not e.g. the text from tables.
Edit:
I want to be able to upload the MS file through the FileUpload widget.
There are a couple of ways you can do that.
First, isolate the actual file data. upload.data
is actually a dictionary, see here. So do something like:
rawdata = upload.data[0]
(Note that this format has changed over different version of ipywidgets. The above example is from the documentation of the latest version. Read the relevant version of the documentation, or investigate the data in IPython, and adjust accordingly.)
- write
rawdata
to e.g.foo.docx
and open that. That would certainly work, but it does seem somewhat un-elegant. docx.Document
can work with file-like objects. So you could create anio.BytesIO
object, and use that.
Like this:
foo = io.BytesIO(rawdata)
doc = docx.Document(foo)
Extracting MS Word document formatting elements along with raw text information
The character level formatting ("font") properties are available at the run level. A paragraph is made up of runs. So you can get what you want by going down to that level, like:
for paragraph in document.paragraphs:
for run in paragraph.runs:
font = run.font
is_bold = font.bold
etc.
The biggest problem you're likely to encounter with that is that the run only knows about formatting that's been directly applied to it. If it looks the way it does because a style has been applied to it, you would have to query the style (which also has a font object) to see what properties it has.
Note that the python-docx that Mike was talking about is the legacy version which was completely rewritten after v0.2.0 (now 0.8.6). Docs are here: http://python-docx.readthedocs.org/en/latest/
Related Topics
Staleelementreferenceexception on Python Selenium
Ensure a Single Instance of an Application in Linux
How to Do Sed Like Text Replace With Python
Why Is the Command Bound to a Button or Event Executed When Declared
Importing Files from Different Folder
Difference Between Python'S List Methods Append and Extend
A Non-Blocking Read on a Subprocess.Pipe in Python
Printing Lists as Tabular Data
How to Pass Arguments to a Button Command in Tkinter
How to Install Packages Offline
Selenium.Common.Exceptions.Invalidselectorexception With "Span:Contains('String')"
What Do I Use on Linux to Make a Python Program Executable
Change Default Python Version from 2.4 to 2.6
How to Select Rows from a Dataframe Based on Column Values
What Do Lambda Function Closures Capture
What Does the 'B' Character Do in Front of a String Literal
What Is the Purpose of the Return Statement? How Is It Different from Printing