What is the fastest way to parse large XML docs in Python?
I looks to me as if you do not need any DOM capabilities from your program. I would second the use of the (c)ElementTree library. If you use the iterparse function of the cElementTree module, you can work your way through the xml and deal with the events as they occur.
Note however, Fredriks advice on using cElementTree iterparse function:
to parse large files, you can get rid of elements as soon as you’ve processed them:
for event, elem in iterparse(source):
if elem.tag == "record":
... process record elements ...
elem.clear()
The above pattern has one drawback; it does not clear the root element, so you will end up with a single element with lots of empty child elements. If your files are huge, rather than just large, this might be a problem. To work around this, you need to get your hands on the root element. The easiest way to do this is to enable start events, and save a reference to the first element in a variable:
# get an iterable
context = iterparse(source, events=("start", "end"))
# turn it into an iterator
context = iter(context)
# get the root element
event, root = context.next()
for event, elem in context:
if event == "end" and elem.tag == "record":
... process record elements ...
root.clear()
The lxml.iterparse() does not allow this.
The previous does not work on Python 3.7, consider the following way to get the first element.
import xml.etree.ElementTree as ET
# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
for index, (event, elem) in enumerate(context):
# Get the root element.
if index == 0:
root = elem
if event == "end" and elem.tag == "record":
# ... process record elements ...
root.clear()
Large XML File Parsing in Python
Consider iterparse
for fast streaming processing that builds tree incrementally. In each iteration build a list of dictionaries that you can then pass into pandas.DataFrame
constructor once outside loop. Adjust below to name of repeating nodes of root's children:
from xml.etree.ElementTree import iterparse
#from cElementTree import iterparse
import pandas as pd
file_path = r"/path/to/Input.xml"
dict_list = []
for _, elem in iterparse(file_path, events=("end",)):
if elem.tag == "row":
dict_list.append({'rowId': elem.attrib['Id'],
'UserId': elem.attrib['UserId'],
'Name': elem.attrib['Name'],
'Date': elem.attrib['Date'],
'Class': elem.attrib['Class'],
'TagBased': elem.attrib['TagBased']})
# dict_list.append(elem.attrib) # ALTERNATIVELY, PARSE ALL ATTRIBUTES
elem.clear()
df = pd.DataFrame(dict_list)
How to iteratively parse a large XML file in Python?
Iterating over a huge XML file is always painful.
I'll go over all the process from start to finish, suggesting the best practices for keeping low memory yet maximizing parsing speed.
First no need to store ET.iterparse as a variable. Just iterate over it like
for event, elem in ET.iterparse(xml_file, events=("start", "end")):
This iterator created for, well..., iteration without storing anything else in memory except the current tag. Also you don't need root.clear()
with this new approach and you can go as long as your hard disk space allows it for huge XML files.
Your code should look like:
from xml.etree import cElementTree as ET
def get_all_records(xml_file_path, record_category, name_types, name_components):
all_records = []
for event, elem in ET.iterparse(xml_file_path, events=("start", "end")):
if event == 'end' and elem.tag == record_category and elem.attrib['action'] != 'del':
record_contents = get_record(elem, name_types=name_types, name_components=name_components, record_id=elem.attrib['id'])
if record_contents:
all_records += record_contents
return all_records
Also, please think carefully about the reason you need to store the whole list of all_records
. If it's only for writing CSV file at the end of the process - this reason isn't good enough and can cause memory issues when scaling to even bigger XML files.
Make sure you write each new row to CSV as this row happens, turning memory issues into none-issue.
P.S.
If you need to store several tags before you find your main tag in order to parse this historic information as you go down the XML file - just store it locally in some new variables. This comes handy whenever future data in XML file makes you go backwards to a specific tag you know already occured.
Python: How to process large XML file with a lot of childs in 1 root
Consider iterparse that allows you to work on elements as the tree is being built. Below checks if name attribute is equivalent to surname attribute. Use the if
block to process further like conditionally append values to a list:
import xml.etree.ElementTree as et
data = []
path = "/path/to/source.xml"
# get an iterable
context = et.iterparse(path, events=("start", "end"))
# turn it into an iterator
context = iter(context)
# get the root element
ev, root = next(context)
for ev, el in context:
if ev == 'start' and el.tag == 'detail':
print(el.attrib['name'] == el.attrib['surname'])
data.append([el.attrib['name'], el.attrib['surname']])
root.clear()
print(data)
# False
# False
# False
# [['John', 'Smith'], ['Michael', 'Smith'], ['Nick', 'Smith']]
Parsing big XML files efficiently
Based on the links you posted in the comments I came up with the following to iterate and split more efficiently that works fine:
from lxml import etree, objectify
import pandas as pd
file = 'some_huge_file.xml'
time = []
data1_element1_x = []
data1_element1_y = []
data1_element2_x = []
data1_element2_y = []
data2_element1_x = []
data2_element1_y = []
data2_element2_x = []
data2_element2_y = []
if file.lower().endswith('.xml'):
for event, elem in etree.iterparse(file):
if elem.tag == "subelement":
time.append(elem.get('tc'))
for child in elem:
if child.tag == "element1":
split_data = child.text.split(" ")
data1_element1_x.append(float(split_data[0]))
data1_element1_y.append(float(split_data[1]))
data2_element1_x.append(float(split_data[2]))
data2_element1_y.append(float(split_data[3]))
elif child.tag == "element2":
split_data = child.text.split(" ")
data1_element2_x.append(float(split_data[0]))
data1_element2_y.append(float(split_data[1]))
data2_element2_x.append(float(split_data[2]))
data2_element2_y.append(float(split_data[3]))
elem.clear()
df = pd.DataFrame({
'Time':time,
'Data1_element1_x': data1_element1_x,
'Data1_element1_y': data1_element1_y,
'Data1_element2_x': data1_element2_x,
'Data1_element2_y': data1_element2_y,
'Data2_element1_x': data2_element1_x,
'Data2_element1_y': data2_element1_y,
'Data2_element2_x': data2_element2_x,
'Data2_element2_y': data2_element2_y
})
print(df)
Related Topics
How to Install Python 3.X and 2.X on the Same Windows Computer
How to Include Related Model Fields Using Django Rest Framework
How to Convert a Numpy Array to Pil Image Applying Matplotlib Colormap
How to Change a Module Variable from Another Module
Create PDF from a List of Images
Unnamed Python Objects Have the Same Id
In-Memory Size of a Python Structure
Split a String to Even Sized Chunks
Pil: Convert Bytearray to Image
How to Open a File for Exclusive Access in Python
Where's My JSON Data in My Incoming Django Request
Update Row Values Where Certain Condition Is Met in Pandas
What Is the Correct Way to Set Python's Locale on Windows
Speed of Calculating Powers (In Python)
How to Randomly Choose a Maths Operator and Ask Recurring Maths Questions with It