python convert microsoft office docs to plain text on linux
I'd go for the command line-solution (and then use the Python subprocess module to run the tools from Python).
Convertors for msword (catdoc), excel (xls2csv) and ppt (catppt) can be found (in source form) here: http://vitus.wagner.pp.ru/software/catdoc/.
Can't really comment on the usefullness of catppt but catdoc and xls2csv work great!
But be sure to first search your distributions repositories... On ubuntu for example catdoc is just one fast apt-get away.
Python & MS Word: Convert .doc to .docx?
You are working with Linux/ubuntu, you can use LibreOffice’s inbuilt converter.
SYNTAX
lowriter --convert-to docx *.doc
Example
lowriter --convert-to docx testdoc.doc
This will convert all doc files to docx and save in the same folder itself.
extracting text from MS word files in python
You could make a subprocess call to antiword. Antiword is a linux commandline utility for dumping text out of a word doc. Works pretty well for simple documents (obviously it loses formatting). It's available through apt, and probably as RPM, or you could compile it yourself.
Convert doc to txt via commandline
You will have to use two different command-line tools, depending if you are working with .doc or .docx format.
For .doc use catdoc:
catdoc foo.doc > foo.txt
For .docx use docx2txt:
docx2txt foo.docx
The latter will produce a file called foo.txt in the same directory as the original.
I'm not sure which Linux distribution you are using, but both catdoc and docx2txt are available from the Ubuntu repositories, for example:
apt-get install docx2txt
Or with Homebrew on Mac:
brew install docx2txt
Related Topics
Accessing Dict Keys Like an Attribute
Read File with Timeout in Python
What Is the Easiest Way to Detect Key Presses in Python 3 on a Linux MAChine
How to Make Python Script Press 'Enter' When Prompted on Shell
What Are the Tkinter Events for Horizontal Edge Scrolling (In Linux)
Python Multiprocessing - Debugging Oserror: [Errno 12] Cannot Allocate Memory
Socket Shutdown and Rebind - How to Avoid Long Wait
Sharing Psycopg2/Libpq Connections Across Processes
Convert Utc Datetime String to Local Datetime
How to Select a Specific Input Device with Pyaudio
Getting Another Program's Output as Input on the Fly
Error: Could Not Build Wheels for Glpk Which Use Pep 517 and Cannot Be Installed Directly
Django Form Dropdown List of Numbers
How to Make Shell Output Redirect (>) Write While Script Is Still Running