using lxml and iterparse() to parse a big (+- 1Gb) XML file
for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
for child in element:
print(child.tag, child.text)
element.clear()
the final clear will stop you from using too much memory.
[update:] to get "everything between ... as a string" i guess you want one of:
for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
print(etree.tostring(element))
element.clear()
or
for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
print(''.join([etree.tostring(child) for child in element]))
element.clear()
or perhaps even:
for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
print(''.join([child.text for child in element]))
element.clear()
Parsing large XML file with lxml
You can use etree.iterparse
to avoid loading the whole file in memory:
events = ("start", "end")
with open("dblp.xml", "r") as fo:
context = etree.iterparse(fo, events=events)
for action, elem in context:
# Do something
This will allow you to only extract entities you need while ignoring others.
Using Python Iterparse For Large XML Files
Try Liza Daly's fast_iter. After processing an element, elem
, it calls elem.clear()
to remove descendants and also removes preceding siblings.
def fast_iter(context, func, *args, **kwargs):
"""
http://lxml.de/parsing.html#modifying-the-tree
Based on Liza Daly's fast_iter
http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
See also http://effbot.org/zone/element-iterparse.htm
"""
for event, elem in context:
func(elem, *args, **kwargs)
# It's safe to call clear() here because no descendants will be
# accessed
elem.clear()
# Also eliminate now-empty references from the root node to elem
for ancestor in elem.xpath('ancestor-or-self::*'):
while ancestor.getprevious() is not None:
del ancestor.getparent()[0]
del context
def process_element(elem):
print elem.xpath( 'description/text( )' )
context = etree.iterparse( MYFILE, tag='item' )
fast_iter(context,process_element)
Daly's article is an excellent read, especially if you are processing large XML files.
Edit: The fast_iter
posted above is a modified version of Daly's fast_iter
. After processing an element, it is more aggressive at removing other elements that are no longer needed.
The script below shows the difference in behavior. Note in particular that orig_fast_iter
does not delete the A1
element, while the mod_fast_iter
does delete it, thus saving more memory.
import lxml.etree as ET
import textwrap
import io
def setup_ABC():
content = textwrap.dedent('''\
<root>
<A1>
<B1></B1>
<C>1<D1></D1></C>
<E1></E1>
</A1>
<A2>
<B2></B2>
<C>2<D></D></C>
<E2></E2>
</A2>
</root>
''')
return content
def study_fast_iter():
def orig_fast_iter(context, func, *args, **kwargs):
for event, elem in context:
print('Processing {e}'.format(e=ET.tostring(elem)))
func(elem, *args, **kwargs)
print('Clearing {e}'.format(e=ET.tostring(elem)))
elem.clear()
while elem.getprevious() is not None:
print('Deleting {p}'.format(
p=(elem.getparent()[0]).tag))
del elem.getparent()[0]
del context
def mod_fast_iter(context, func, *args, **kwargs):
"""
http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
Author: Liza Daly
See also http://effbot.org/zone/element-iterparse.htm
"""
for event, elem in context:
print('Processing {e}'.format(e=ET.tostring(elem)))
func(elem, *args, **kwargs)
# It's safe to call clear() here because no descendants will be
# accessed
print('Clearing {e}'.format(e=ET.tostring(elem)))
elem.clear()
# Also eliminate now-empty references from the root node to elem
for ancestor in elem.xpath('ancestor-or-self::*'):
print('Checking ancestor: {a}'.format(a=ancestor.tag))
while ancestor.getprevious() is not None:
print(
'Deleting {p}'.format(p=(ancestor.getparent()[0]).tag))
del ancestor.getparent()[0]
del context
content = setup_ABC()
context = ET.iterparse(io.BytesIO(content), events=('end', ), tag='C')
orig_fast_iter(context, lambda elem: None)
# Processing <C>1<D1/></C>
# Clearing <C>1<D1/></C>
# Deleting B1
# Processing <C>2<D/></C>
# Clearing <C>2<D/></C>
# Deleting B2
print('-' * 80)
"""
The improved fast_iter deletes A1. The original fast_iter does not.
"""
content = setup_ABC()
context = ET.iterparse(io.BytesIO(content), events=('end', ), tag='C')
mod_fast_iter(context, lambda elem: None)
# Processing <C>1<D1/></C>
# Clearing <C>1<D1/></C>
# Checking ancestor: root
# Checking ancestor: A1
# Checking ancestor: C
# Deleting B1
# Processing <C>2<D/></C>
# Clearing <C>2<D/></C>
# Checking ancestor: root
# Checking ancestor: A2
# Deleting A1
# Checking ancestor: C
# Deleting B2
study_fast_iter()
Iterparse big XML, with low memory footprint, and get all, even nested, Sequence Elements
Comment: As it now only outputs results
Outputing results are only for demonstration, tracing and debuging.
To write a record
and addresses
into a SQL
database, for example using sqlite3
, do:
c.execute("INSERT INTO entity(id, name) VALUES(:id, :name)", record)
addresses = []
for addr in record['addresses']:
addr[1].update({'id': record['id']})
addresses.append(addr[1])
c.executemany("INSERT INTO adresses(id, address, city) VALUES(:id, :address, :city)", addresses)
To flatten for pandas
Preconditon outside the loop: df = pd.DataFrame()
from copy import copy
addresses = copy(record['addresses'])
del record['addresses']
df_records = []
for addr in addresses:
record.update(addr[1])
df_records.append(record)
df = df.append(df_records, ignore_index=True)
Question: Use
etree.iterparse
to include all nodes in XML file
The following class Entity
do:
- Parse the
XML
File usinglxml.etree.iterparse
. - There is no File size limit, as the
<entity>...</entity>
Element Tree are deleted after processing. - Builds from every
<entity>...</entity>
Tree adict {tag, value, ...}
. - Using of
generator objects
toyield
thedict
. - Sequence Elements, e.g.
<addresses>/<address>
are List of Tuple[(address, {tag, text})...
.
ToDo:
- To flatten into many Records, loop
record['addresses']
- To equal different tag names:
address
andaddress1
- To flatten, Sequence tags, e.g.
<titels>
,<probs>
and<dobs>
from lxml import etree
class Entity:
def __init__(self, fh):
"""
Initialize 'iterparse' to only generate 'end' events on tag '<entity>'
:param fh: File Handle from the XML File to parse
"""
self.context = etree.iterparse(fh, events=("end",), tag=['entity'])
def _parse(self):
"""
Parse the XML File for all '<entity>...</entity>' Elements
Clear/Delete the Element Tree after processing
:return: Yield the current '<entity>...</entity>' Element Tree
"""
for event, elem in self.context:
yield elem
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
def sequence(self, elements):
"""
Expand a Sequence Element, e.g. <titels> to a Tuple ('titel', text).
If found a nested Sequence Element, e.g. <address>,
to a Tuple ('address', {tag, text})
:param elements: The Sequence Element
:return: List of Tuple [(tag1, value), (tag2, value), ... ,(tagn, value))
"""
_elements = []
for elem in elements:
if len(elem):
_elements.append((elem.tag, dict(self.sequence(elem))))
else:
_elements.append((elem.tag, elem.text))
return _elements
def __iter__(self):
"""
Iterate all '<entity>...</entity>' Element Trees yielded from self._parse()
:return: Dict var 'entity' {tag1, value, tag2, value, ... ,tagn, value}}
"""
for xml_entity in self._parse():
entity = {'id': xml_entity.attrib['id']}
for elem in xml_entity:
# if elem is Sequence
if len(elem):
# Append tuple(tag, value)
entity[elem.tag] = self.sequence(elem)
else:
entity[elem.tag] = elem.text
yield entity
if __name__ == "__main__":
with open('.\\FILE.XML', 'rb') as in_xml_
for record in Entity(in_xml):
print("record:{}".format(record))
for key, value in record.items():
if isinstance(value, (list)):
#print_list(key, value)
print("{}:{}".format(key, value))
else:
print("{}:{}".format(key, value))
Output: Shows only the first Record and only 4 fields.
Note: There is a pitfall with unique tag names:address
andaddress1
record:{'id': '1124353', 'titles': {'title': 'Foot... (omitted for brevity)
id:1124353
name:DAVID, Beckham
titles:[('title', 'Football player')]
addresses:
address:{'city': 'London', 'address': None, 'post... (omitted for brevity)
address:{'city': 'London', 'address1': '35-37 Par... (omitted for brevity)
Tested with Python: 3.5 - lxml.etree: 3.7.1
Python lxml iterparse sort by attribute large xml file
import lxml.etree as ET
from copy import deepcopy
xml_source = 'ss_sky_sw_xmltv.xml'
xml_output = 'ss_sky_sw_xmltv_parsed.xml'
# icons with these dimensions (width, height) will be removed:
remove_dimensions = (
(180, 135),
(120, 180),
)
tree = ET.parse(xml_source)
root = tree.getroot()
for programme in root.iterfind('programme'):
# Create copy of all icons to reinsert them in the right order
icons = deepcopy(sorted(programme.findall('icon'), key=lambda x: int(x.attrib['height'])))
# Remove all icons from programme
for old_icon in programme.findall('icon'):
programme.remove(old_icon)
# Reinsert the items
for new_icon in icons:
# Create a dict to compare
dimensions = int(new_icon.attrib['width']), int(new_icon.attrib['height'])
# Compare the dict if it should be removed (not included again)
if dimensions not in remove_dimensions:
programme.append(new_icon)
# Save the file
tree.write(xml_output, xml_declaration=True, pretty_print=True)
Python LXML iterparse function: memory not getting freed while parsing a huge XML
I took the code from https://stackoverflow.com/a/7171543/131187, chopped out comments and print statements, and added a suitable func
to get this. I wouldn't like to guess how much time it would take to process a 500 Mb file!
Even in writing func
I have done nothing original, having adopted the original authors' use of the xpath expression, 'ancestor-or-self::*', to provide the absolute path that you want.
However, since this code conforms more closely to the original scripts it might not leak memory.
import lxml.etree as ET
input_xml = 'temp.xml'
for line in open(input_xml).readlines():
print (line[:-1])
def mod_fast_iter(context, func, *args, **kwargs):
for event, elem in context:
func(elem, *args, **kwargs)
elem.clear()
for ancestor in elem.xpath('ancestor-or-self::*'):
while ancestor.getprevious() is not None:
del ancestor.getparent()[0]
del context
def func(elem):
content = '' if not elem.text else elem.text.strip()
if content:
ancestors = elem.xpath('ancestor-or-self::*')
print ('%s=%s' % ('.'.join([_.tag for _ in ancestors]), content))
print ('\nResult:\n')
context = ET.iterparse(open(input_xml , 'rb'), events=('end', ))
mod_fast_iter(context, func)
Output:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE A>
<A>
<B>
<C>
abc
</C>
<D>
abd
</D>
</B>
</A
Result:
A.B.C=abc
A.B.D=abd
How to parse this huge XML file with nested elements using lxml the efficient way?
You might try something like this:
import MySQLdb
from lxml import etree
import config
def fast_iter(context, func, args=[], kwargs={}):
# http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
# Author: Liza Daly
for event, elem in context:
func(elem, *args, **kwargs)
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
del context
def extract_paper_elements(element,cursor):
pub={}
pub['InventoryID']=element.attrib['ID']
try:
pub['PublisherClassID']=element.xpath('PublisherClass/@ID')[0]
except IndexError:
pub['PublisherClassID']=None
pub['PublisherClassID']=element.xpath('PublisherClass/@ID')[0]
for key in ('Name','Type','ID'):
try:
pub[key]=element.xpath(
'PublisherClass/Publisher/PublisherDetails/{k}/text()'.format(k=key))[0]
except IndexError:
pub[key]=None
sql='''INSERT INTO Publishers (InventoryID, PublisherClassID, Name, Type, ID)
VALUES (%s, %s, %s, %s, %s)
'''
args=[pub.get(key) for key in
('InventoryID', 'PublisherClassID', 'Name', 'Type', 'ID')]
print(args)
# cursor.execute(sql,args)
for bookdetail in element.xpath('descendant::BookList/Listing/Book/BookDetail'):
pub['BookDetailID']=bookdetail.attrib['ID']
for key in ('BookName', 'Author', 'Pages', 'ISBN'):
try:
pub[key]=bookdetail.xpath('{k}/text()'.format(k=key))[0]
except IndexError:
pub[key]=None
sql='''INSERT INTO Books
(PublisherID, BookDetailID, Name, Author, Pages, ISBN)
VALUES (%s, %s, %s, %s, %s, %s)
'''
args=[pub.get(key) for key in
('ID', 'BookDetailID', 'BookName', 'Author', 'Pages', 'ISBN')]
# cursor.execute(sql,args)
print(args)
def main():
context = etree.iterparse("book.xml", events=("end",), tag='Inventory')
connection=MySQLdb.connect(
host=config.HOST,user=config.USER,
passwd=config.PASS,db=config.MYDB)
cursor=connection.cursor()
fast_iter(context,extract_paper_elements,args=(cursor,))
cursor.close()
connection.commit()
connection.close()
if __name__ == '__main__':
main()
- Don't use
fast_iter2
. The originalfast_iter
separates the
useful utility from the specific processing function
(extract_paper_elements
).fast_iter2
mixes the two together
leaving you with no repeatable code. - If you set the
tag
parameter inetree.iterparse("book.xml",
then your processing function
events=("end",), tag='Inventory')extract_paper_elements
will only seeInventory
elements. - Given an Inventory element you can use the
xpath
method to burrow
down and scrape the desired data. args
andkwargs
parameters were added tofast_iter
socursor
can be passed toextract_paper_elements
.
Parse large XML with lxml
You are parsing a namespaced document, and there is no 'page'
tag present, because that only applies to tags without a namespace.
You are instead looking for the '{http://www.mediawiki.org/xml/export-0.8/}page'
element, which contains a '{http://www.mediawiki.org/xml/export-0.8/}ns'
element.
Many lxml
methods do let you specify a namespace map to make matching easier, but the iterparse()
method is not one of them, unfortunately.
The following .iterparse()
call certainly processes the right page
tags:
context = etree.iterparse('test.xml', events=('end',), tag='{http://www.mediawiki.org/xml/export-0.8/}page')
but you'll need to use .find()
to get the ns
and title
tags on the page element, or use xpath()
calls to get the text directly:
def process_element(elem):
if elem.xpath("./*[local-name()='ns']/text()=0"):
print elem.xpath("./*[local-name()='title']/text()")[0]
which, for your input example, prints:
>>> fast_iter(context, process_element)
MediaWiki:Category
Related Topics
Intersection of Two Lists Including Duplicates
Differencebetween an Opencv Bgr Image and Its Reverse Version Rgb Image[:,:,::-1]
How to Use a Conditional Expression (Expression with If and Else) in a List Comprehension
Python Progression Path - from Apprentice to Guru
How to Plot a Confusion Matrix
How to Extract Text and Text Coordinates from a PDF File
What Does "Typeerror 'Xxx' Object Is Not Callable" Means
How to Perform Two-Dimensional Interpolation Using Scipy
Skip Multiple Iterations in Loop
How to Tell If a String Repeats Itself in Python
Pandas - Convert Strings to Time Without Date
How to Make a Barplot and a Lineplot in the Same Seaborn Plot with Different Y Axes Nicely
Why Do Many Examples Use 'Fig, Ax = Plt.Subplots()' in Matplotlib/Pyplot/Python
Why Can't Environmental Variables Set in Python Persist