Using Lxml and Iterparse() to Parse a Big (+- 1Gb) Xml File

using lxml and iterparse() to parse a big (+- 1Gb) XML file

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
for child in element:
print(child.tag, child.text)
element.clear()

the final clear will stop you from using too much memory.

[update:] to get "everything between ... as a string" i guess you want one of:

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
print(etree.tostring(element))
element.clear()

or

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
print(''.join([etree.tostring(child) for child in element]))
element.clear()

or perhaps even:

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
print(''.join([child.text for child in element]))
element.clear()

Parsing large XML file with lxml

You can use etree.iterparse to avoid loading the whole file in memory:

events = ("start", "end")
with open("dblp.xml", "r") as fo:
context = etree.iterparse(fo, events=events)
for action, elem in context:
# Do something

This will allow you to only extract entities you need while ignoring others.

Using Python Iterparse For Large XML Files

Try Liza Daly's fast_iter. After processing an element, elem, it calls elem.clear() to remove descendants and also removes preceding siblings.

def fast_iter(context, func, *args, **kwargs):
"""
http://lxml.de/parsing.html#modifying-the-tree
Based on Liza Daly's fast_iter
http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
See also http://effbot.org/zone/element-iterparse.htm
"""
for event, elem in context:
func(elem, *args, **kwargs)
# It's safe to call clear() here because no descendants will be
# accessed
elem.clear()
# Also eliminate now-empty references from the root node to elem
for ancestor in elem.xpath('ancestor-or-self::*'):
while ancestor.getprevious() is not None:
del ancestor.getparent()[0]
del context

def process_element(elem):
print elem.xpath( 'description/text( )' )

context = etree.iterparse( MYFILE, tag='item' )
fast_iter(context,process_element)

Daly's article is an excellent read, especially if you are processing large XML files.


Edit: The fast_iter posted above is a modified version of Daly's fast_iter. After processing an element, it is more aggressive at removing other elements that are no longer needed.

The script below shows the difference in behavior. Note in particular that orig_fast_iter does not delete the A1 element, while the mod_fast_iter does delete it, thus saving more memory.

import lxml.etree as ET
import textwrap
import io

def setup_ABC():
content = textwrap.dedent('''\
<root>
<A1>
<B1></B1>
<C>1<D1></D1></C>
<E1></E1>
</A1>
<A2>
<B2></B2>
<C>2<D></D></C>
<E2></E2>
</A2>
</root>
''')
return content

def study_fast_iter():
def orig_fast_iter(context, func, *args, **kwargs):
for event, elem in context:
print('Processing {e}'.format(e=ET.tostring(elem)))
func(elem, *args, **kwargs)
print('Clearing {e}'.format(e=ET.tostring(elem)))
elem.clear()
while elem.getprevious() is not None:
print('Deleting {p}'.format(
p=(elem.getparent()[0]).tag))
del elem.getparent()[0]
del context

def mod_fast_iter(context, func, *args, **kwargs):
"""
http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
Author: Liza Daly
See also http://effbot.org/zone/element-iterparse.htm
"""
for event, elem in context:
print('Processing {e}'.format(e=ET.tostring(elem)))
func(elem, *args, **kwargs)
# It's safe to call clear() here because no descendants will be
# accessed
print('Clearing {e}'.format(e=ET.tostring(elem)))
elem.clear()
# Also eliminate now-empty references from the root node to elem
for ancestor in elem.xpath('ancestor-or-self::*'):
print('Checking ancestor: {a}'.format(a=ancestor.tag))
while ancestor.getprevious() is not None:
print(
'Deleting {p}'.format(p=(ancestor.getparent()[0]).tag))
del ancestor.getparent()[0]
del context

content = setup_ABC()
context = ET.iterparse(io.BytesIO(content), events=('end', ), tag='C')
orig_fast_iter(context, lambda elem: None)
# Processing <C>1<D1/></C>
# Clearing <C>1<D1/></C>
# Deleting B1
# Processing <C>2<D/></C>
# Clearing <C>2<D/></C>
# Deleting B2

print('-' * 80)
"""
The improved fast_iter deletes A1. The original fast_iter does not.
"""
content = setup_ABC()
context = ET.iterparse(io.BytesIO(content), events=('end', ), tag='C')
mod_fast_iter(context, lambda elem: None)
# Processing <C>1<D1/></C>
# Clearing <C>1<D1/></C>
# Checking ancestor: root
# Checking ancestor: A1
# Checking ancestor: C
# Deleting B1
# Processing <C>2<D/></C>
# Clearing <C>2<D/></C>
# Checking ancestor: root
# Checking ancestor: A2
# Deleting A1
# Checking ancestor: C
# Deleting B2

study_fast_iter()

Iterparse big XML, with low memory footprint, and get all, even nested, Sequence Elements

Comment: As it now only outputs results

Outputing results are only for demonstration, tracing and debuging.

To write a record and addresses into a SQL database, for example using sqlite3, do:

c.execute("INSERT INTO entity(id, name) VALUES(:id, :name)", record)
addresses = []
for addr in record['addresses']:
addr[1].update({'id': record['id']})
addresses.append(addr[1])
c.executemany("INSERT INTO adresses(id, address, city) VALUES(:id, :address, :city)", addresses)

To flatten for pandas

Preconditon outside the loop: df = pd.DataFrame()

from copy import copy

addresses = copy(record['addresses'])
del record['addresses']

df_records = []
for addr in addresses:
record.update(addr[1])
df_records.append(record)

df = df.append(df_records, ignore_index=True)

Question: Use etree.iterparse to include all nodes in XML file

The following class Entity do:

  • Parse the XML File using lxml.etree.iterparse.
  • There is no File size limit, as the <entity>...</entity> Element Tree are deleted after processing.
  • Builds from every <entity>...</entity> Tree a dict {tag, value, ...}.
  • Using of generator objects to yield the dict.
  • Sequence Elements, e.g. <addresses>/<address> are List of Tuple [(address, {tag, text})....

ToDo:

  • To flatten into many Records, loop record['addresses']
  • To equal different tag names: address and address1
  • To flatten, Sequence tags, e.g. <titels>, <probs> and <dobs>

from lxml import etree

class Entity:
def __init__(self, fh):
"""
Initialize 'iterparse' to only generate 'end' events on tag '<entity>'

:param fh: File Handle from the XML File to parse
"""
self.context = etree.iterparse(fh, events=("end",), tag=['entity'])

def _parse(self):
"""
Parse the XML File for all '<entity>...</entity>' Elements
Clear/Delete the Element Tree after processing

:return: Yield the current '<entity>...</entity>' Element Tree
"""
for event, elem in self.context:
yield elem

elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]

def sequence(self, elements):
"""
Expand a Sequence Element, e.g. <titels> to a Tuple ('titel', text).
If found a nested Sequence Element, e.g. <address>,
to a Tuple ('address', {tag, text})

:param elements: The Sequence Element
:return: List of Tuple [(tag1, value), (tag2, value), ... ,(tagn, value))
"""
_elements = []
for elem in elements:
if len(elem):
_elements.append((elem.tag, dict(self.sequence(elem))))
else:
_elements.append((elem.tag, elem.text))

return _elements

def __iter__(self):
"""
Iterate all '<entity>...</entity>' Element Trees yielded from self._parse()

:return: Dict var 'entity' {tag1, value, tag2, value, ... ,tagn, value}}
"""
for xml_entity in self._parse():
entity = {'id': xml_entity.attrib['id']}

for elem in xml_entity:
# if elem is Sequence
if len(elem):
# Append tuple(tag, value)
entity[elem.tag] = self.sequence(elem)
else:
entity[elem.tag] = elem.text

yield entity

if __name__ == "__main__":
with open('.\\FILE.XML', 'rb') as in_xml_
for record in Entity(in_xml):
print("record:{}".format(record))

for key, value in record.items():
if isinstance(value, (list)):
#print_list(key, value)
print("{}:{}".format(key, value))
else:
print("{}:{}".format(key, value))

Output: Shows only the first Record and only 4 fields.

Note: There is a pitfall with unique tag names: address and address1

record:{'id': '1124353', 'titles': {'title': 'Foot... (omitted for brevity)
id:1124353
name:DAVID, Beckham
titles:[('title', 'Football player')]
addresses:
address:{'city': 'London', 'address': None, 'post... (omitted for brevity)
address:{'city': 'London', 'address1': '35-37 Par... (omitted for brevity)

Tested with Python: 3.5 - lxml.etree: 3.7.1

Python lxml iterparse sort by attribute large xml file

import lxml.etree as ET
from copy import deepcopy

xml_source = 'ss_sky_sw_xmltv.xml'
xml_output = 'ss_sky_sw_xmltv_parsed.xml'
# icons with these dimensions (width, height) will be removed:
remove_dimensions = (
(180, 135),
(120, 180),
)

tree = ET.parse(xml_source)
root = tree.getroot()
for programme in root.iterfind('programme'):
# Create copy of all icons to reinsert them in the right order
icons = deepcopy(sorted(programme.findall('icon'), key=lambda x: int(x.attrib['height'])))
# Remove all icons from programme
for old_icon in programme.findall('icon'):
programme.remove(old_icon)

# Reinsert the items
for new_icon in icons:
# Create a dict to compare
dimensions = int(new_icon.attrib['width']), int(new_icon.attrib['height'])
# Compare the dict if it should be removed (not included again)
if dimensions not in remove_dimensions:
programme.append(new_icon)

# Save the file
tree.write(xml_output, xml_declaration=True, pretty_print=True)

Python LXML iterparse function: memory not getting freed while parsing a huge XML

I took the code from https://stackoverflow.com/a/7171543/131187, chopped out comments and print statements, and added a suitable func to get this. I wouldn't like to guess how much time it would take to process a 500 Mb file!

Even in writing func I have done nothing original, having adopted the original authors' use of the xpath expression, 'ancestor-or-self::*', to provide the absolute path that you want.

However, since this code conforms more closely to the original scripts it might not leak memory.

import lxml.etree as ET

input_xml = 'temp.xml'
for line in open(input_xml).readlines():
print (line[:-1])

def mod_fast_iter(context, func, *args, **kwargs):
for event, elem in context:
func(elem, *args, **kwargs)
elem.clear()
for ancestor in elem.xpath('ancestor-or-self::*'):
while ancestor.getprevious() is not None:
del ancestor.getparent()[0]
del context

def func(elem):
content = '' if not elem.text else elem.text.strip()
if content:
ancestors = elem.xpath('ancestor-or-self::*')
print ('%s=%s' % ('.'.join([_.tag for _ in ancestors]), content))

print ('\nResult:\n')
context = ET.iterparse(open(input_xml , 'rb'), events=('end', ))
mod_fast_iter(context, func)

Output:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE A>
<A>
<B>
<C>
abc
</C>
<D>
abd
</D>
</B>
</A

Result:

A.B.C=abc
A.B.D=abd

How to parse this huge XML file with nested elements using lxml the efficient way?

You might try something like this:

import MySQLdb
from lxml import etree
import config

def fast_iter(context, func, args=[], kwargs={}):
# http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
# Author: Liza Daly
for event, elem in context:
func(elem, *args, **kwargs)
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
del context

def extract_paper_elements(element,cursor):
pub={}
pub['InventoryID']=element.attrib['ID']
try:
pub['PublisherClassID']=element.xpath('PublisherClass/@ID')[0]
except IndexError:
pub['PublisherClassID']=None
pub['PublisherClassID']=element.xpath('PublisherClass/@ID')[0]
for key in ('Name','Type','ID'):
try:
pub[key]=element.xpath(
'PublisherClass/Publisher/PublisherDetails/{k}/text()'.format(k=key))[0]
except IndexError:
pub[key]=None
sql='''INSERT INTO Publishers (InventoryID, PublisherClassID, Name, Type, ID)
VALUES (%s, %s, %s, %s, %s)
'''
args=[pub.get(key) for key in
('InventoryID', 'PublisherClassID', 'Name', 'Type', 'ID')]
print(args)
# cursor.execute(sql,args)
for bookdetail in element.xpath('descendant::BookList/Listing/Book/BookDetail'):
pub['BookDetailID']=bookdetail.attrib['ID']
for key in ('BookName', 'Author', 'Pages', 'ISBN'):
try:
pub[key]=bookdetail.xpath('{k}/text()'.format(k=key))[0]
except IndexError:
pub[key]=None
sql='''INSERT INTO Books
(PublisherID, BookDetailID, Name, Author, Pages, ISBN)
VALUES (%s, %s, %s, %s, %s, %s)
'''
args=[pub.get(key) for key in
('ID', 'BookDetailID', 'BookName', 'Author', 'Pages', 'ISBN')]
# cursor.execute(sql,args)
print(args)

def main():
context = etree.iterparse("book.xml", events=("end",), tag='Inventory')
connection=MySQLdb.connect(
host=config.HOST,user=config.USER,
passwd=config.PASS,db=config.MYDB)
cursor=connection.cursor()

fast_iter(context,extract_paper_elements,args=(cursor,))

cursor.close()
connection.commit()
connection.close()

if __name__ == '__main__':
main()
  1. Don't use fast_iter2. The original fast_iter separates the
    useful utility from the specific processing function
    (extract_paper_elements). fast_iter2 mixes the two together
    leaving you with no repeatable code.
  2. If you set the tag parameter in etree.iterparse("book.xml",
    events=("end",), tag='Inventory')
    then your processing function
    extract_paper_elements will only see Inventory elements.
  3. Given an Inventory element you can use the xpath method to burrow
    down and scrape the desired data.
  4. args and kwargs parameters were added to fast_iter so cursor
    can be passed to extract_paper_elements.

Parse large XML with lxml

You are parsing a namespaced document, and there is no 'page' tag present, because that only applies to tags without a namespace.

You are instead looking for the '{http://www.mediawiki.org/xml/export-0.8/}page' element, which contains a '{http://www.mediawiki.org/xml/export-0.8/}ns' element.

Many lxml methods do let you specify a namespace map to make matching easier, but the iterparse() method is not one of them, unfortunately.

The following .iterparse() call certainly processes the right page tags:

context = etree.iterparse('test.xml', events=('end',), tag='{http://www.mediawiki.org/xml/export-0.8/}page')

but you'll need to use .find() to get the ns and title tags on the page element, or use xpath() calls to get the text directly:

def process_element(elem):
if elem.xpath("./*[local-name()='ns']/text()=0"):
print elem.xpath("./*[local-name()='title']/text()")[0]

which, for your input example, prints:

>>> fast_iter(context, process_element)
MediaWiki:Category


Related Topics



Leave a reply



Submit