How to use python-docx to replace text in a Word document and save
UPDATE: There are a couple of paragraph-level functions that do a good job of this and can be found on the GitHub site for python-docx
.
- This one will replace a regex-match with a replacement str. The replacement string will appear formatted the same as the first character of the matched string.
- This one will isolate a run such that some formatting can be applied to that word or phrase, like highlighting each occurence of "foobar" in the text or perhaps making it bold or appear in a larger font.
The current version of python-docx does not have a
search()
function or a replace()
function. These are requested fairly frequently, but an implementation for the general case is quite tricky and it hasn't risen to the top of the backlog yet.Several folks have had success though, getting done what they need, using the facilities already present. Here's an example. It has nothing to do with sections by the way :)
for paragraph in document.paragraphs:
if 'sea' in paragraph.text:
print paragraph.text
paragraph.text = 'new text containing ocean'
To search in Tables as well, you would need to use something like:for table in document.tables:
for row in table.rows:
for cell in row.cells:
for paragraph in cell.paragraphs:
if 'sea' in paragraph.text:
paragraph.text = paragraph.text.replace("sea", "ocean")
If you pursue this path, you'll probably discover pretty quickly what the complexities are. If you replace the entire text of a paragraph, that will remove any character-level formatting, like a word or phrase in bold or italic.By the way, the code from @wnnmaw's answer is for the legacy version of python-docx and won't work at all with versions after 0.3.0.
Python-docx: Find and replace all placeholder numbers in Word doc with random numbers
In Word, runs break at arbitrary locations in the text:
What is Run-level content in python-docx?.
How to effectively replace sentences in word document with python
How to use python-docx to replace text in a Word document and save
There are a couple of paragraph-level functions that do a good job of this and can be found on the GitHub site for python-docx.Fortunately it's usually copy-pastable with good results :)This one will replace a regex-match with a replacement str. The replacement string will appear formatted the same as the first character of the matched string.
This one will isolate a run such that some formatting can be applied to that word or phrase, like highlighting each occurence of "foobar" in the text or perhaps making it bold or appear in a larger font.
Python docx Replace string in paragraph while keeping style
I posted this question (even though I saw a few identical ones on here), because none of those (to my knowledge) solved the issue. There was one using a oodocx library, which I tried, but did not work. So I found a workaround.
The code is very similar, but the logic is: when I find the paragraph that contains the string I wish to replace, add another loop using runs.
(this will only work if the string I wish to replace has the same formatting).
def replace_string(filename):
doc = Document(filename)
for p in doc.paragraphs:
if 'old text' in p.text:
inline = p.runs
# Loop added to work with runs (strings with same style)
for i in range(len(inline)):
if 'old text' in inline[i].text:
text = inline[i].text.replace('old text', 'new text')
inline[i].text = text
print p.text
doc.save('dest1.docx')
return 1
Replace string in paragraph while keeping style docx library
This should meet your needs. Requires docx2python 2.0.0+
from docx2python.utilities import replace_docx_text
replace_docx_text(
input_filename,
output_filename,
("Apples", "Bananas"), # replace Apples with Bananas
("Pears", "Apples"), # replace Pears with Apples
("Bananas", "Pears"), # replace Bananas with Pears
html=True,
)
You may have a problem if your replacement strings include tabs or symbols, but "regular" text replacement will work and preserve most[1] formatting.To allow this, docx2python will not replace text strings where formatting changes, e.g., "part of this string is bold", unless you specify html=False
, in which case strings will be replaced regardless of format, and some formatting will be lost.
[1] The following will be preserved:
- italic
- bold
- underline
- strike
- superscript
- subscript
- small caps
- all caps
- highlighted
- font size
- colored text
- (some others, but not guaranteed)
My workflow for doing this is to keep all formatting in Word. That is, I create a template in Word, slice out the context I need, then put everything back together like a puzzle.
This github "project" is an example (one file) of how I replace text in tables (where the tables can be any size).
https://github.com/ShayHill/replace_docx_tables
How to replace multiple words in .docx file and save the docx file using python-docx
This following code works for me. This preserve the format as well. Hope this will help others.
def replace_string1(filename='test.docx'):
doc = Document(filename)
list= ['ABC','XYZ']
list2 = ['PQR','DEF']
for p in doc.paragraphs:
inline = p.runs
for j in range(0,len(inline)):
for i in range(0, len(list)):
inline[j].text = inline[j].text.replace(list[i], list2[i])
print(p.text)
print(inline[j].text)
doc.save('dest1.docx')
return 1
How to modify existing text within Header and Footer word document using Python docx
It is already present in the documentation of the python-docx
library, you can find it Here. Otherwise, here is a simple script that does what you want (adapt it for your need):
from docx import Document
from docx.shared import Pt
# read doc
doc = Document('./word.docx')
# define your style font and size for the header
style_h = doc.styles['Header']
font_h = style_h.font
font_h.name = 'Arial'
font_h.bold = True
font_h.size = Pt(20)
# define your style font and size for the footer
style_f = doc.styles['Footer']
font_f = style_f.font
font_f.name = 'Aharoni'
font_f.bold = False
font_f.size = Pt(10)
# for the header:
header = doc.sections[0].header
paragraph_h = header.paragraphs[0]
paragraph_h.text = 'January 2021' # insert new value here.
paragraph_h.style = doc.styles['Header'] # this is what changes the style
# for the footer:
footer = doc.sections[0].footer
paragraph_f = footer.paragraphs[0]
paragraph_f.text = 'January 2021' # insert new value here.
paragraph_f.style = doc.styles['Footer']# this is what changes the style
doc.save('./word_doc2.docx')
N.B: This work on only one page docs, you can easily add a loop if you want to apply the same thing to all the pages (or use pre-set functions from the docx
library. Text-Replace in docx and save the changed file with python-docx
As it seems to be, Docx for Python is not meant to store a full Docx with images, headers, ... , but only contains the inner content of the document. So there's no simple way to do this.
Howewer, here is how you could do it:
First, have a look at the docx tag wiki:
It explains how the docx file can be unzipped: Here's how a typical file looks like:
+--docProps
| + app.xml
| \ core.xml
+ res.log
+--word //this folder contains most of the files that control the content of the document
| + document.xml //Is the actual content of the document
| + endnotes.xml
| + fontTable.xml
| + footer1.xml //Containst the elements in the footer of the document
| + footnotes.xml
| +--media //This folder contains all images embedded in the word
| | \ image1.jpeg
| + settings.xml
| + styles.xml
| + stylesWithEffects.xml
| +--theme
| | \ theme1.xml
| + webSettings.xml
| \--_rels
| \ document.xml.rels //this document tells word where the images are situated
+ [Content_Types].xml
\--_rels
\ .rels
Docx only gets one part of the document, in the method opendocxdef opendocx(file):
'''Open a docx file, return a document XML tree'''
mydoc = zipfile.ZipFile(file)
xmlcontent = mydoc.read('word/document.xml')
document = etree.fromstring(xmlcontent)
return document
It only gets the document.xml file. What I recommend you to do is:
- get the content of the document with **opendocx*
- Replace the document.xml with the advReplace method
- Open the docx as a zip, and replace the document.xml content's by the new xml content.
- Close and output the zipped file (renaming it to output.docx)
Related Topics
Python Pandas Dataframe, Is It Pass-By-Value or Pass-By-Reference
Counting Each Letter's Frequency in a String
How to Detect If a File Is Binary (Non-Text) in Python
Cannot Redirect Output When I Run Python Script on Windows Using Just Script's Name
Pandas Latitude-Longitude to Distance Between Successive Rows
How to Reverse a Dictionary That Has Repeated Values
How to Calculate the Inverse of the Normal Cumulative Distribution Function in Python
How to Find the Maximum Value in a List of Tuples
Os.Path.Dirname(_File_) Returns Empty
How to Obscure a Line Behind a Surface Plot in Matplotlib
When Should an Attribute Be Private and Made a Read-Only Property
Generating PDF-Latex with Python Script
Matplotlib: How to Show a Figure That Has Been Closed
Logisticregression: Unknown Label Type: 'Continuous' Using Sklearn in Python
Numpy Version of "Exponential Weighted Moving Average", Equivalent to Pandas.Ewm().Mean()