What Is the Fastest Way to Parse Large Xml Docs in Python

What is the fastest way to parse large XML docs in Python?

I looks to me as if you do not need any DOM capabilities from your program. I would second the use of the (c)ElementTree library. If you use the iterparse function of the cElementTree module, you can work your way through the xml and deal with the events as they occur.

Note however, Fredriks advice on using cElementTree iterparse function:

to parse large files, you can get rid of elements as soon as you’ve processed them:

for event, elem in iterparse(source):
    if elem.tag == "record":
        ... process record elements ...
        elem.clear()

The above pattern has one drawback; it does not clear the root element, so you will end up with a single element with lots of empty child elements. If your files are huge, rather than just large, this might be a problem. To work around this, you need to get your hands on the root element. The easiest way to do this is to enable start events, and save a reference to the first element in a variable:

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

The lxml.iterparse() does not allow this.

The previous does not work on Python 3.7, consider the following way to get the first element.

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

Large XML File Parsing in Python

Consider iterparse for fast streaming processing that builds tree incrementally. In each iteration build a list of dictionaries that you can then pass into pandas.DataFrame constructor once outside loop. Adjust below to name of repeating nodes of root's children:

from xml.etree.ElementTree import iterparse
#from cElementTree import iterparse
import pandas as pd

file_path = r"/path/to/Input.xml"
dict_list = []

for _, elem in iterparse(file_path, events=("end",)):
    if elem.tag == "row":
        dict_list.append({'rowId': elem.attrib['Id'],
                          'UserId': elem.attrib['UserId'],
                          'Name': elem.attrib['Name'],
                          'Date': elem.attrib['Date'],
                          'Class': elem.attrib['Class'],
                          'TagBased': elem.attrib['TagBased']})

        # dict_list.append(elem.attrib)      # ALTERNATIVELY, PARSE ALL ATTRIBUTES

        elem.clear()

df = pd.DataFrame(dict_list)

How to iteratively parse a large XML file in Python?

Iterating over a huge XML file is always painful.

I'll go over all the process from start to finish, suggesting the best practices for keeping low memory yet maximizing parsing speed.

First no need to store ET.iterparse as a variable. Just iterate over it like

for event, elem in ET.iterparse(xml_file, events=("start", "end")):
This iterator created for, well..., iteration without storing anything else in memory except the current tag. Also you don't need root.clear() with this new approach and you can go as long as your hard disk space allows it for huge XML files.

Your code should look like:

from xml.etree import cElementTree as ET

def get_all_records(xml_file_path, record_category, name_types, name_components):
    all_records = []
    for event, elem in ET.iterparse(xml_file_path, events=("start", "end")):
        if event == 'end' and elem.tag == record_category and elem.attrib['action'] != 'del':
            record_contents = get_record(elem, name_types=name_types, name_components=name_components, record_id=elem.attrib['id'])
            if record_contents:
                all_records += record_contents
    return all_records

Also, please think carefully about the reason you need to store the whole list of all_records. If it's only for writing CSV file at the end of the process - this reason isn't good enough and can cause memory issues when scaling to even bigger XML files.

Make sure you write each new row to CSV as this row happens, turning memory issues into none-issue.

P.S.

If you need to store several tags before you find your main tag in order to parse this historic information as you go down the XML file - just store it locally in some new variables. This comes handy whenever future data in XML file makes you go backwards to a specific tag you know already occured.

Python: How to process large XML file with a lot of childs in 1 root

Consider iterparse that allows you to work on elements as the tree is being built. Below checks if name attribute is equivalent to surname attribute. Use the if block to process further like conditionally append values to a list:

import xml.etree.ElementTree as et

data = []
path = "/path/to/source.xml"

# get an iterable
context = et.iterparse(path, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
ev, root = next(context)

for ev, el in context:
    if ev == 'start' and el.tag == 'detail':
        print(el.attrib['name'] == el.attrib['surname'])
        data.append([el.attrib['name'], el.attrib['surname']])
        root.clear()

print(data)
# False
# False
# False

# [['John', 'Smith'], ['Michael', 'Smith'], ['Nick', 'Smith']]

Parsing big XML files efficiently

Based on the links you posted in the comments I came up with the following to iterate and split more efficiently that works fine:

from lxml import etree, objectify
import pandas as pd

file = 'some_huge_file.xml'

time = []
data1_element1_x = []
data1_element1_y = []
data1_element2_x = []
data1_element2_y = []
data2_element1_x = []
data2_element1_y = []
data2_element2_x = []
data2_element2_y = []

if file.lower().endswith('.xml'):
    for event, elem in etree.iterparse(file):
        if elem.tag == "subelement":
            time.append(elem.get('tc'))
            for child in elem:
                if child.tag == "element1":
                    split_data = child.text.split(" ")
                    data1_element1_x.append(float(split_data[0]))
                    data1_element1_y.append(float(split_data[1]))
                    data2_element1_x.append(float(split_data[2]))
                    data2_element1_y.append(float(split_data[3]))
                elif child.tag == "element2":
                    split_data = child.text.split(" ")
                    data1_element2_x.append(float(split_data[0]))
                    data1_element2_y.append(float(split_data[1]))
                    data2_element2_x.append(float(split_data[2]))
                    data2_element2_y.append(float(split_data[3]))
             elem.clear()
df = pd.DataFrame({
    'Time':time, 
    'Data1_element1_x': data1_element1_x, 
    'Data1_element1_y': data1_element1_y, 
    'Data1_element2_x': data1_element2_x, 
    'Data1_element2_y': data1_element2_y, 
    'Data2_element1_x': data2_element1_x, 
    'Data2_element1_y': data2_element1_y, 
    'Data2_element2_x': data2_element2_x, 
    'Data2_element2_y': data2_element2_y
})

print(df)

What Is the Fastest Way to Parse Large Xml Docs in Python