How to Write a Python Script That Can Read Doc/Docx Files and Convert Them to Txt

DOCX file to text file conversion using Python

Problem

as your code says in the last for loop:

        for para in document.paragraphs:
textFilename = path + d.split(".")[0] + ".txt"
with io.open(textFilename,"w", encoding="utf-8") as textFile:
x=unicode(para.text)
textFile.write((x))

for each paragraph in whole document, you try to open a file named textFilename so let's say you have a file named MyFile.docx in /home/python/resumes/ so the textFilename value that contains the path will be /home/python/resumes/MyFile.txt always in whole of for loop, so the problem is that you open the same file in w mode which is a Write mode, and will overwrite the whole file content.

Solution:

you must open the file once out of that for loop then try add paragraphs one by one to it.

Python - doc to docx file converter input, file path from a txt file

with open("file_path",'r') as file_content:
content=file_content.read()
content=content.split('\n')

You can read the data of the file using the method above, Then covert the data of file into a list(or any other iteratable data type) so that we can use it with for loop.I used content=content.split('\n') to split the data of content by '\n' (Every time you press enter key, a new line character '\n' is sended), you can use any other character to split.

for i in content:
# the code you want to execute

Note

Some useful links:

  • Split
  • File writing
  • File read and write


Related Topics



Leave a reply



Submit