Python Convert Microsoft Office Docs to Plain Text on Linux

python convert microsoft office docs to plain text on linux

I'd go for the command line-solution (and then use the Python subprocess module to run the tools from Python).

Convertors for msword (catdoc), excel (xls2csv) and ppt (catppt) can be found (in source form) here: http://vitus.wagner.pp.ru/software/catdoc/.

Can't really comment on the usefullness of catppt but catdoc and xls2csv work great!

But be sure to first search your distributions repositories... On ubuntu for example catdoc is just one fast apt-get away.

Python & MS Word: Convert .doc to .docx?

You are working with Linux/ubuntu, you can use LibreOffice’s inbuilt converter.

SYNTAX

lowriter --convert-to docx *.doc

Example

lowriter --convert-to docx testdoc.doc

This will convert all doc files to docx and save in the same folder itself.

extracting text from MS word files in python

You could make a subprocess call to antiword. Antiword is a linux commandline utility for dumping text out of a word doc. Works pretty well for simple documents (obviously it loses formatting). It's available through apt, and probably as RPM, or you could compile it yourself.

Convert doc to txt via commandline

You will have to use two different command-line tools, depending if you are working with .doc or .docx format.

For .doc use catdoc:

catdoc foo.doc > foo.txt

For .docx use docx2txt:

docx2txt foo.docx

The latter will produce a file called foo.txt in the same directory as the original.

I'm not sure which Linux distribution you are using, but both catdoc and docx2txt are available from the Ubuntu repositories, for example:

apt-get install docx2txt

Or with Homebrew on Mac:

brew install docx2txt


Related Topics



Leave a reply



Submit