Read documents(Python)
You are trying to open a document with the extension .docx
, which cannot be done with the open()
function. Instead, you can try using the docx2txt
library, as follows:
import docx2txt
my_text = docx2txt.process("test.docx")
print(my_text)
Reading .doc file in Python using antiword in Windows (also .docx)
You can use antiword
command line utility to do this, I know most of you would have tried it but still I wanted to share.
- Download
antiword
from here
- Extract the
antiword
folder toC:\
and add the pathC:\antiword
to yourPATH
environment variable.
import os, docx2txt
def get_doc_text(filepath, file):
if file.endswith('.docx'):
text = docx2txt.process(file)
return text
elif file.endswith('.doc'):
# converting .doc to .docx
doc_file = filepath + file
docx_file = filepath + file + 'x'
if not os.path.exists(docx_file):
os.system('antiword ' + doc_file + ' > ' + docx_file)
with open(docx_file) as f:
text = f.read()
os.remove(docx_file) #docx_file was just to read, so deleting
else:
# already a file with same name as doc exists having docx extension,
# which means it is a different file, so we cant read it
print('Info : file with same name of doc exists having docx extension, so we cant read it')
text = ''
return text
Now call this function:filepath = "D:\\input\\"
files = os.listdir(filepath)
for file in files:
text = get_doc_text(filepath, file)
print(text)
This could be good alternate way to read .doc
file in Python
on Windows
.Hope it helps, Thanks.
Related Topics
Pycharm Error: 'No Module' When Trying to Import Own Module (Python Script)
What Does 'Wb' Mean in This Code, Using Python
How to Tell Pycharm What Type a Parameter Is Expected to Be
How to Include Image Files in Django Templates
How Do Threads Work in Python, and What Are Common Python-Threading Specific Pitfalls
Source Interface with Python and Urllib2
Global Dictionaries Don't Need Keyword Global to Modify Them
Animated Subplots Using Matplotlib
Fastest Way to Sort Each Row in a Pandas Dataframe
How to Check If Two Strings Are Anagrams of Each Other
Matplotlib Custom Marker/Symbol
When Should an Attribute Be Private and Made a Read-Only Property
How to Limit the Maximum Value of a Numeric Field in a Django Model
Matplotlib: Finding Out Xlim and Ylim After Zoom
How to Get the Ip Address from a Http Request Using the Requests Library