How to Use Python-Docx to Replace Text in a Word Document and Save

How to use python-docx to replace text in a Word document and save

UPDATE: There are a couple of paragraph-level functions that do a good job of this and can be found on the GitHub site for python-docx.

  1. This one will replace a regex-match with a replacement str. The replacement string will appear formatted the same as the first character of the matched string.
  2. This one will isolate a run such that some formatting can be applied to that word or phrase, like highlighting each occurence of "foobar" in the text or perhaps making it bold or appear in a larger font.

The current version of python-docx does not have a search() function or a replace() function. These are requested fairly frequently, but an implementation for the general case is quite tricky and it hasn't risen to the top of the backlog yet.

Several folks have had success though, getting done what they need, using the facilities already present. Here's an example. It has nothing to do with sections by the way :)

for paragraph in document.paragraphs:
if 'sea' in paragraph.text:
print paragraph.text
paragraph.text = 'new text containing ocean'

To search in Tables as well, you would need to use something like:

for table in document.tables:
for row in table.rows:
for cell in row.cells:
for paragraph in cell.paragraphs:
if 'sea' in paragraph.text:
paragraph.text = paragraph.text.replace("sea", "ocean")

If you pursue this path, you'll probably discover pretty quickly what the complexities are. If you replace the entire text of a paragraph, that will remove any character-level formatting, like a word or phrase in bold or italic.

By the way, the code from @wnnmaw's answer is for the legacy version of python-docx and won't work at all with versions after 0.3.0.

Python-docx: Find and replace all placeholder numbers in Word doc with random numbers

In Word, runs break at arbitrary locations in the text:

  • What is Run-level content in python-docx?.

  • How to effectively replace sentences in word document with python

You might be interested in the links in this answer that demonstrate the (surprisingly complex) work required to do this sort of thing in the general case:

How to use python-docx to replace text in a Word document and save

There are a couple of paragraph-level functions that do a good job of this and can be found on the GitHub site for python-docx.

This one will replace a regex-match with a replacement str. The replacement string will appear formatted the same as the first character of the matched string.

This one will isolate a run such that some formatting can be applied to that word or phrase, like highlighting each occurence of "foobar" in the text or perhaps making it bold or appear in a larger font.

Fortunately it's usually copy-pastable with good results :)

Python docx Replace string in paragraph while keeping style

I posted this question (even though I saw a few identical ones on here), because none of those (to my knowledge) solved the issue. There was one using a oodocx library, which I tried, but did not work. So I found a workaround.

The code is very similar, but the logic is: when I find the paragraph that contains the string I wish to replace, add another loop using runs.
(this will only work if the string I wish to replace has the same formatting).

def replace_string(filename):
doc = Document(filename)
for p in doc.paragraphs:
if 'old text' in p.text:
inline = p.runs
# Loop added to work with runs (strings with same style)
for i in range(len(inline)):
if 'old text' in inline[i].text:
text = inline[i].text.replace('old text', 'new text')
inline[i].text = text
print p.text

doc.save('dest1.docx')
return 1

Replace string in paragraph while keeping style docx library

This should meet your needs. Requires docx2python 2.0.0+

from docx2python.utilities import replace_docx_text

replace_docx_text(
input_filename,
output_filename,
("Apples", "Bananas"), # replace Apples with Bananas
("Pears", "Apples"), # replace Pears with Apples
("Bananas", "Pears"), # replace Bananas with Pears
html=True,
)

You may have a problem if your replacement strings include tabs or symbols, but "regular" text replacement will work and preserve most[1] formatting.

To allow this, docx2python will not replace text strings where formatting changes, e.g., "part of this string is bold", unless you specify html=False, in which case strings will be replaced regardless of format, and some formatting will be lost.

[1] The following will be preserved:

  • italic
  • bold
  • underline
  • strike
  • superscript
  • subscript
  • small caps
  • all caps
  • highlighted
  • font size
  • colored text
  • (some others, but not guaranteed)

Edit for follow-up question, how do I replace marker text in tables?

My workflow for doing this is to keep all formatting in Word. That is, I create a template in Word, slice out the context I need, then put everything back together like a puzzle.

This github "project" is an example (one file) of how I replace text in tables (where the tables can be any size).

https://github.com/ShayHill/replace_docx_tables

How to replace multiple words in .docx file and save the docx file using python-docx

This following code works for me. This preserve the format as well. Hope this will help others.

def replace_string1(filename='test.docx'):
doc = Document(filename)
list= ['ABC','XYZ']
list2 = ['PQR','DEF']
for p in doc.paragraphs:
inline = p.runs
for j in range(0,len(inline)):
for i in range(0, len(list)):
inline[j].text = inline[j].text.replace(list[i], list2[i])
print(p.text)
print(inline[j].text)
doc.save('dest1.docx')
return 1

How to modify existing text within Header and Footer word document using Python docx

It is already present in the documentation of the python-docx library, you can find it Here. Otherwise, here is a simple script that does what you want (adapt it for your need):

from docx import Document
from docx.shared import Pt

# read doc
doc = Document('./word.docx')

# define your style font and size for the header
style_h = doc.styles['Header']
font_h = style_h.font
font_h.name = 'Arial'
font_h.bold = True
font_h.size = Pt(20)

# define your style font and size for the footer
style_f = doc.styles['Footer']
font_f = style_f.font
font_f.name = 'Aharoni'
font_f.bold = False
font_f.size = Pt(10)

# for the header:
header = doc.sections[0].header
paragraph_h = header.paragraphs[0]
paragraph_h.text = 'January 2021' # insert new value here.
paragraph_h.style = doc.styles['Header'] # this is what changes the style

# for the footer:
footer = doc.sections[0].footer
paragraph_f = footer.paragraphs[0]
paragraph_f.text = 'January 2021' # insert new value here.
paragraph_f.style = doc.styles['Footer']# this is what changes the style

doc.save('./word_doc2.docx')

N.B: This work on only one page docs, you can easily add a loop if you want to apply the same thing to all the pages (or use pre-set functions from the docx library.

Text-Replace in docx and save the changed file with python-docx

As it seems to be, Docx for Python is not meant to store a full Docx with images, headers, ... , but only contains the inner content of the document. So there's no simple way to do this.

Howewer, here is how you could do it:

First, have a look at the docx tag wiki:

It explains how the docx file can be unzipped: Here's how a typical file looks like:

+--docProps
| + app.xml
| \ core.xml
+ res.log
+--word //this folder contains most of the files that control the content of the document
| + document.xml //Is the actual content of the document
| + endnotes.xml
| + fontTable.xml
| + footer1.xml //Containst the elements in the footer of the document
| + footnotes.xml
| +--media //This folder contains all images embedded in the word
| | \ image1.jpeg
| + settings.xml
| + styles.xml
| + stylesWithEffects.xml
| +--theme
| | \ theme1.xml
| + webSettings.xml
| \--_rels
| \ document.xml.rels //this document tells word where the images are situated
+ [Content_Types].xml
\--_rels
\ .rels

Docx only gets one part of the document, in the method opendocx

def opendocx(file):
'''Open a docx file, return a document XML tree'''
mydoc = zipfile.ZipFile(file)
xmlcontent = mydoc.read('word/document.xml')
document = etree.fromstring(xmlcontent)
return document

It only gets the document.xml file.

What I recommend you to do is:

  1. get the content of the document with **opendocx*
  2. Replace the document.xml with the advReplace method
  3. Open the docx as a zip, and replace the document.xml content's by the new xml content.
  4. Close and output the zipped file (renaming it to output.docx)

If you have node.js installed, be informed that I have worked on DocxGenJS which is templating engine for docx documents, the library is in active development and will be released soon as a node module.



Related Topics



Leave a reply



Submit