Extract Text from Xml Documents in Python

Extracting text from XML using python

There is already a built-in XML library, notably ElementTree. For example:

>>> from xml.etree import cElementTree as ET
>>> xmlstr = """
... <root>
... <page>
... <title>Chapter 1</title>
... <content>Welcome to Chapter 1</content>
... </page>
... <page>
... <title>Chapter 2</title>
... <content>Welcome to Chapter 2</content>
... </page>
... </root>
... """
>>> root = ET.fromstring(xmlstr)
>>> for page in list(root):
... title = page.find('title').text
... content = page.find('content').text
... print('title: %s; content: %s' % (title, content))
...
title: Chapter 1; content: Welcome to Chapter 1
title: Chapter 2; content: Welcome to Chapter 2

How to extract text from xml file using python

Your XML document has namespace specified, so it becomes something like:

for country in tree.findall('.//{urn:hl7-org:v3}patientRole'):
number = country.find('{urn:hl7-org:v3}telecom').attrib['value']
print(number)

Output:

tel:+1(303)-554-8889

Extract text from XML file

from xml.dom import minidom

doc = minidom.parse("yourxmlfile.xml")

print(doc.getElementsByTagName("alt1")[0].firstChild.data)
print(doc.getElementsByTagName("alt2")[0].firstChild.data)

Example of extracting the data using minidom.

extract text from xml documents in python

You could simply strip out any tags:

>>> import re
>>> txt = """<bookstore>
... <book category="COOKING">
... <title lang="english">Everyday Italian</title>
... <author>Giada De Laurentiis</author>
... <year>2005</year>
... <price>300.00</price>
... </book>
...
... <book category="CHILDREN">
... <title lang="english">Harry Potter</title>
... <author>J K. Rowling </author>
... <year>2005</year>
... <price>625.00</price>
... </book>
... </bookstore>"""
>>> exp = re.compile(r'<.*?>')
>>> text_only = exp.sub('',txt).strip()
>>> text_only
'Everyday Italian\n Giada De Laurentiis\n 2005\n 300.00\n
\n\n \n Harry Potter\n J K. Rowling \n 2005\n 6
25.00'

But if you just want to search files for some text in Linux, you can use grep:

burhan@sandbox:~$ grep "Harry Potter" file.xml
<title lang="english">Harry Potter</title>

If you want to search in a file, use the grep command above, or open the file and search for it in Python:

>>> import re
>>> exp = re.compile(r'<.*?>')
>>> with open('file.xml') as f:
... lines = ''.join(line for line in f.readlines())
... text_only = exp.sub('',lines).strip()
...
>>> if 'Harry Potter' in text_only:
... print 'It exists'
... else:
... print 'It does not'
...
It exists

Extracting text from an xml doc with Python ElementTree

I think you're looking for itertext method:

# Iterate over all the sample block
for sample in tree.xpath('//sample'):
print(''.join(sample.itertext()))

Full code:

# Load module
import lxml.etree as etree

# Load data
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse('data.xml', parser)

# Iterate over all the sample block
for sample in tree.xpath('//sample'):
print(''.join(sample.itertext()))

# programmer l'
# enregistreur
# des
# oeuvres
# La Chevauchée de Virginia

Extract xml text when elements in between text

Using BeautifulSoup:

list_test.xml:

<do title='Example document' date='today'>
<db descr='First level'>
<P>
Some text here that
<af d='reference 1'>continues</af>
but then has some more stuff.
</P>
</db>

and then:

from bs4 import BeautifulSoup

with open('list_test.xml','r') as f:
soup = BeautifulSoup(f.read(), "html.parser")
for line in soup.find_all('p'):
print(line.text)

OUTPUT:

Some text here that
continues
but then has some more stuff.

EDIT:

Using elementree:

import xml.etree.ElementTree as ET
xml = '<p> Some text here that <af d="reference 1">continues</af> but then has some more stuff.</p>'
tree = ET.fromstring(xml)
print(''.join(tree.itertext()))

OUTPUT:

Some text here that continues but then has some more stuff.


Related Topics



Leave a reply



Submit