Extracting text from XML using python
There is already a built-in XML library, notably ElementTree
. For example:
>>> from xml.etree import cElementTree as ET
>>> xmlstr = """
... <root>
... <page>
... <title>Chapter 1</title>
... <content>Welcome to Chapter 1</content>
... </page>
... <page>
... <title>Chapter 2</title>
... <content>Welcome to Chapter 2</content>
... </page>
... </root>
... """
>>> root = ET.fromstring(xmlstr)
>>> for page in list(root):
... title = page.find('title').text
... content = page.find('content').text
... print('title: %s; content: %s' % (title, content))
...
title: Chapter 1; content: Welcome to Chapter 1
title: Chapter 2; content: Welcome to Chapter 2
How to extract text from xml file using python
Your XML document has namespace specified, so it becomes something like:
for country in tree.findall('.//{urn:hl7-org:v3}patientRole'):
number = country.find('{urn:hl7-org:v3}telecom').attrib['value']
print(number)
Output:
tel:+1(303)-554-8889
Extract text from XML file
from xml.dom import minidom
doc = minidom.parse("yourxmlfile.xml")
print(doc.getElementsByTagName("alt1")[0].firstChild.data)
print(doc.getElementsByTagName("alt2")[0].firstChild.data)
Example of extracting the data using minidom.
extract text from xml documents in python
You could simply strip out any tags:
>>> import re
>>> txt = """<bookstore>
... <book category="COOKING">
... <title lang="english">Everyday Italian</title>
... <author>Giada De Laurentiis</author>
... <year>2005</year>
... <price>300.00</price>
... </book>
...
... <book category="CHILDREN">
... <title lang="english">Harry Potter</title>
... <author>J K. Rowling </author>
... <year>2005</year>
... <price>625.00</price>
... </book>
... </bookstore>"""
>>> exp = re.compile(r'<.*?>')
>>> text_only = exp.sub('',txt).strip()
>>> text_only
'Everyday Italian\n Giada De Laurentiis\n 2005\n 300.00\n
\n\n \n Harry Potter\n J K. Rowling \n 2005\n 6
25.00'
But if you just want to search files for some text in Linux, you can use grep
:
burhan@sandbox:~$ grep "Harry Potter" file.xml
<title lang="english">Harry Potter</title>
If you want to search in a file, use the grep
command above, or open the file and search for it in Python:
>>> import re
>>> exp = re.compile(r'<.*?>')
>>> with open('file.xml') as f:
... lines = ''.join(line for line in f.readlines())
... text_only = exp.sub('',lines).strip()
...
>>> if 'Harry Potter' in text_only:
... print 'It exists'
... else:
... print 'It does not'
...
It exists
Extracting text from an xml doc with Python ElementTree
I think you're looking for itertext
method:
# Iterate over all the sample block
for sample in tree.xpath('//sample'):
print(''.join(sample.itertext()))
Full code:
# Load module
import lxml.etree as etree
# Load data
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse('data.xml', parser)
# Iterate over all the sample block
for sample in tree.xpath('//sample'):
print(''.join(sample.itertext()))
# programmer l'
# enregistreur
# des
# oeuvres
# La Chevauchée de Virginia
Extract xml text when elements in between text
Using BeautifulSoup:
list_test.xml:
<do title='Example document' date='today'>
<db descr='First level'>
<P>
Some text here that
<af d='reference 1'>continues</af>
but then has some more stuff.
</P>
</db>
and then:
from bs4 import BeautifulSoup
with open('list_test.xml','r') as f:
soup = BeautifulSoup(f.read(), "html.parser")
for line in soup.find_all('p'):
print(line.text)
OUTPUT:
Some text here that
continues
but then has some more stuff.
EDIT:
Using elementree:
import xml.etree.ElementTree as ET
xml = '<p> Some text here that <af d="reference 1">continues</af> but then has some more stuff.</p>'
tree = ET.fromstring(xml)
print(''.join(tree.itertext()))
OUTPUT:
Some text here that continues but then has some more stuff.
Related Topics
Pelican 3.3 Pelican-Quickstart Error "Valueerror: Unknown Locale: Utf-8"
Python/Ipython Importerror: No Module Named Site
Conda Command Will Prompt Error: "Bad Interpreter: No Such File or Directory"
Error: Command 'Gcc' Failed with Exit Status 1 While Installing Eventlet
Error: (-215) !Empty() in Function Detectmultiscale
Positional Argument V.S. Keyword Argument
Removing Duplicate Characters from a String
How to Run a Python Program in the Command Prompt in Windows 7
Selenium Give File Name When Downloading
How to Run an External Command Asynchronously from Python
Prevent Sleep Mode Python (Wakelock on Python)
How to Directly Send a Python Output to Clipboard
Why Is a List Comprehension So Much Faster Than Appending to a List
Scraping: Ssl: Certificate_Verify_Failed Error for Http://En.Wikipedia.Org
Opencv Giving Wrong Color to Colored Images on Loading