Extract Text from Xml Documents in Python

Extracting text from XML using python

There is already a built-in XML library, notably ElementTree. For example:

>>> from xml.etree import cElementTree as ET
>>> xmlstr = """
... <root>
... <page>
...   <title>Chapter 1</title>
...   <content>Welcome to Chapter 1</content>
... </page>
... <page>
...  <title>Chapter 2</title>
...  <content>Welcome to Chapter 2</content>
... </page>
... </root>
... """
>>> root = ET.fromstring(xmlstr)
>>> for page in list(root):
...     title = page.find('title').text
...     content = page.find('content').text
...     print('title: %s; content: %s' % (title, content))
...
title: Chapter 1; content: Welcome to Chapter 1
title: Chapter 2; content: Welcome to Chapter 2

How to extract text from xml file using python

Your XML document has namespace specified, so it becomes something like:

for country in tree.findall('.//{urn:hl7-org:v3}patientRole'):
    number = country.find('{urn:hl7-org:v3}telecom').attrib['value']
    print(number)

Output:

tel:+1(303)-554-8889

Extract text from XML file

from xml.dom import minidom

doc = minidom.parse("yourxmlfile.xml")

print(doc.getElementsByTagName("alt1")[0].firstChild.data)
print(doc.getElementsByTagName("alt2")[0].firstChild.data)

Example of extracting the data using minidom.

extract text from xml documents in python

You could simply strip out any tags:

>>> import re
>>> txt = """<bookstore>
...     <book category="COOKING">
...         <title lang="english">Everyday Italian</title>
...         <author>Giada De Laurentiis</author>
...         <year>2005</year>
...         <price>300.00</price>
...     </book>
...
...     <book category="CHILDREN">
...         <title lang="english">Harry Potter</title>
...         <author>J K. Rowling </author>
...         <year>2005</year>
...         <price>625.00</price>
...     </book>
... </bookstore>"""
>>> exp = re.compile(r'<.*?>')
>>> text_only = exp.sub('',txt).strip()
>>> text_only
'Everyday Italian\n        Giada De Laurentiis\n        2005\n        300.00\n
  \n\n    \n        Harry Potter\n        J K. Rowling \n        2005\n        6
25.00'

But if you just want to search files for some text in Linux, you can use grep:

burhan@sandbox:~$ grep "Harry Potter" file.xml
        <title lang="english">Harry Potter</title>

If you want to search in a file, use the grep command above, or open the file and search for it in Python:

>>> import re
>>> exp = re.compile(r'<.*?>')
>>> with open('file.xml') as f:
...     lines = ''.join(line for line in f.readlines())
...     text_only = exp.sub('',lines).strip()
...
>>> if 'Harry Potter' in text_only:
...    print 'It exists'
... else:
...    print 'It does not'
...
It exists

Extracting text from an xml doc with Python ElementTree

I think you're looking for itertext method:

# Iterate over all the sample block
for sample in tree.xpath('//sample'):
    print(''.join(sample.itertext()))

Full code:

# Load module
import lxml.etree as etree

# Load data
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse('data.xml', parser)

# Iterate over all the sample block
for sample in tree.xpath('//sample'):
    print(''.join(sample.itertext()))

# programmer l'
# enregistreur
# des
# oeuvres
# La Chevauchée de Virginia

Extract xml text when elements in between text

Using BeautifulSoup:

list_test.xml:

<do title='Example document' date='today'>
<db descr='First level'>
    <P>
        Some text here that
        <af d='reference 1'>continues</af>
        but then has some more stuff.
    </P>
</db>

and then:

from bs4 import BeautifulSoup

with open('list_test.xml','r') as f:
    soup = BeautifulSoup(f.read(), "html.parser")
    for line in soup.find_all('p'):
         print(line.text)

OUTPUT:

Some text here that
continues
but then has some more stuff.

EDIT:

Using elementree:

import xml.etree.ElementTree as ET
xml = '<p> Some text here that <af d="reference 1">continues</af> but then has some more stuff.</p>'
tree = ET.fromstring(xml)
print(''.join(tree.itertext()))