Parsing Huge Xml Files in PHP

Parsing Huge XML Files in PHP

There are only two php APIs that are really suited for processing large files. The first is the old expat api, and the second is the newer XMLreader functions. These apis read continuous streams rather than loading the entire tree into memory (which is what simplexml and DOM does).

For an example, you might want to look at this partial parser of the DMOZ-catalog:

<?php

class SimpleDMOZParser
{
protected $_stack = array();
protected $_file = "";
protected $_parser = null;

protected $_currentId = "";
protected $_current = "";

public function __construct($file)
{
$this->_file = $file;

$this->_parser = xml_parser_create("UTF-8");
xml_set_object($this->_parser, $this);
xml_set_element_handler($this->_parser, "startTag", "endTag");
}

public function startTag($parser, $name, $attribs)
{
array_push($this->_stack, $this->_current);

if ($name == "TOPIC" && count($attribs)) {
$this->_currentId = $attribs["R:ID"];
}

if ($name == "LINK" && strpos($this->_currentId, "Top/Home/Consumer_Information/Electronics/") === 0) {
echo $attribs["R:RESOURCE"] . "\n";
}

$this->_current = $name;
}

public function endTag($parser, $name)
{
$this->_current = array_pop($this->_stack);
}

public function parse()
{
$fh = fopen($this->_file, "r");
if (!$fh) {
die("Epic fail!\n");
}

while (!feof($fh)) {
$data = fread($fh, 4096);
xml_parse($this->_parser, $data, feof($fh));
}
}
}

$parser = new SimpleDMOZParser("content.rdf.u8");
$parser->parse();

Parse Large XML File in PHP Efficiently to Generate SQL

Alright, I have a working example for you with much improvement in execution speed, memory usage, and database load:

<?php
define('INSERT_BATCH_SIZE', 500);
define('DRUG_XML_FILE', 'drugbank.xml');

$servername = "localhost"; // Example : localhost
$username = "root";
$password = "pass";
$dbname = "dbname";

function parseXml($mysql)
{
$drugs = array();

$xmlReader = new XMLReader();
$xmlReader->open(DRUG_XML_FILE);

// Move our pointer to the first <drug /> element.
while ($xmlReader->read() && $xmlReader->name !== 'drug') ;

$drugCount = 0;
$totalDrugs = 0;

// Iterate over the outer <drug /> elements.
while ($xmlReader->name == 'drug')
{
// Convert the node into a SimpleXMLElement for ease of use.
$item = new SimpleXMLElement($xmlReader->readOuterXML());

$name = $item->name;
$description = $item->description;
$casNumber = $item->{'cas-number'};
$created = $item['created'];
$updated = $item['updated'];
$type = $item['type'];

$drugs[] = "('$name', '$description','$casNumber','$created','$updated','$type')";
$drugCount++;
$totalDrugs++;

// Once we've reached the desired batch size, insert the batch and reset the counter.
if ($drugCount >= INSERT_BATCH_SIZE)
{
batchInsertDrugs($mysql, $drugs);
$drugCount = 0;
}

// Go to next <drug />.
$xmlReader->next('drug');
}

$xmlReader->close();

// Insert the leftovers from the last batch.
batchInsertDrugs($mysql, $drugs);

echo "Inserted $totalDrugs total drugs.";
}

function batchInsertDrugs($mysql, &$drugs)
{
// Generate a batched INSERT statement.
$statement = "INSERT INTO `drug` (name, description, cas_number, created, updated, type) VALUES";
$statement = $statement . ' ' . implode(",\n", $drugs);

echo $statement, "\n";

// Run the batch INSERT.
if ($mysql->query($statement))
{
echo "Inserted " . count($drugs) . " drugs.";
}
else
{
echo "INSERT Error: " . $statement . "<br>" . $mysql->error. "<br>" ;
}

// Clear the buffer.
$drugs = array();
}

// Create MySQL connection.
$mysql = new mysqli($servername, $username, $password, $dbname);
if ($mysql->connect_error)
{
die("Connection failed: " . $mysql->connect_error);
}

parseXml($mysql);

I tested this example using the same dataset.
Using SimpleXML in the way that you are leads to parsing the entire document in memory, which is slow and memory-intensive. This approach uses XMLReader, which is a fast pull-parser. You can probably make this faster still using the PHP SAX XML Parser, but it's a bit more complex of a pattern, and the above example will be noticeably better than what you started with.

The other significant change in my example is that we're using MySQL Batched Inserts, so we only actually hit the database every 500 (configurable) items we process. You can tweak this number for better performance. After a certain point, the query will become too large for MySQL to process, but you may be able to do a lot more than 500 at one time.

If you'd like me to explain any part of this further, or if you have any problems with it, just let me know in the comments! :)

Parsing extremely large XML files in php

In PHP, you can read in extreme large XML files with the XMLReaderDocs:

$reader = new XMLReader();
$reader->open($xmlfile);

Extreme large XML files should be stored in a compressed format on disk. At least this makes sense as XML files have a high compression ratio. For example gzipped like large.xml.gz.

PHP supports that quite well with XMLReader via the compression wrappersDocs:

$xmlfile = 'compress.zlib://path/to/large.xml.gz';

$reader = new XMLReader();
$reader->open($xmlfile);

The XMLReader allows you to operate on the current element "only". That means it's forward-only. If you need to keep parser state, you need to build it your own.

I often find it helpful to wrap the basic movements into a set of iterators that know how to operate on XMLReader like iterating through elements or child-elements only. You find this outlined in Parse XML with PHP and XMLReader.

See as well:

  • PHP open gzipped XML

Best way to process large XML in PHP

For a large file, you'll want to use a SAX parser rather than a DOM parser.

With a DOM parser it will read in the whole file and load it into an object tree in memory. With a SAX parser, it will read the file sequentially and call your user-defined callback functions to handle the data (start tags, end tags, CDATA, etc.)

With a SAX parser you'll need to maintain state yourself (e.g. what tag you are currently in) which makes it a bit more complicated, but for a large file it will be much more efficient memory wise.

How to parse a large XML file

For parsing large documents like this I suggest using a streaming parser like XMLReader which will allow you to parse XML without loading the entire file into memory at once. By using its expand() method it's easy to use it in hand with the DOM API.

Tree-based parsers like the DOM are very fast, but take up more memory as the entire document must be loaded up. Streaming parsers like XMLReader keep the memory use down as you're only grabbing a bit of the document at a time, but the trade off is longer processing time.

By using both you can adjust how you use each in tandem in order to get under any hard bounds like memory limits while minimizing processing time.


Example:

$dom    = new DOMDocument();
$xpath = new DOMXPath($dom);
$reader = new XMLReader();
$reader->open('file.xml');

while ($reader->read()) {
if ($reader->nodeType == XMLReader::ELEMENT && $reader->name == 'Book') {
$node = $dom->importNode($reader->expand(), true);
$result = $xpath->evaluate(
'string(self::Book[BookCode = "AD0WNR"]/Subject)',
$node
);
if ($result) {
echo $result;
$reader->close();
break;
}
}
}

What this is doing is iterating through the nodes in the XML. Whenever it hits an element <Book> we:

  1. Import that into the DOM.
  2. Evaluate the XPath expression*.

If the XPath expression found what we're looking for:

  1. Print the result.
  2. Close the file.
  3. Break the read loop.

#2 and #3 we do because we're only looking for a single result. If you have more you want to find, remove those and keep on trucking.


(* I've replace the initial double forward slash from the XPath expression with self:: to act on the context node passed as the second parameter to evaluate() - thanks, @ThW)

PHP Parse huge XML file

Rather than using SimpleXML to fetch all of the nodes within <UpdatedProducts>, you could nest the same code to make it read inside this node for the ` nodes. This will mean that the inner loop will get 1 node at a time...

while ($xml->name == 'UpdatedProducts') {
while ($xml->read() && $xml->name !== 'ProductId');
while ($xml->name == 'ProductId') {
echo $xml->readOuterXml().PHP_EOL;
$xml->next('ProductId');
}
$xml->next('UpdatedProducts');
}

For both of the types, I've tried to reduce it to one loop. It's not ideal but seems to work...

$xml = new \XMLReader();
$xml->open(__DIR__ . '/../../var/tmp/out.xml');
while ($xml->read() && $xml->name != 'UpdatedProducts');
$type = "update";
while ($xml->read() && $xml->name != 'ProductId');
while ($xml->name == 'ProductId') {
$id = $xml->readInnerXml();
if ( !empty($id) ) {
$this->saveToDb($xml->readInnerXml(), $type);
}
while ($xml->read() && $xml->name != 'ProductId'
&& $xml->name != 'RemovedProducts');
if ( $xml->name == 'RemovedProducts' ) {
$type = "remove";
while ($xml->read() && $xml->name != 'ProductId');
}
}

There is an alternative, using a library I've written to wrap around XMLReader (at https://github.com/NigelRel3/XMLReaderReg). You will have to download it as there is no composer version yet. But copy the XMLReaderReg.php script to your project and

require_once "XMLReaderReg.php";

then you can use...

$reader = new XMLReaderReg();
$reader->open(__DIR__ ."/../../var/tmp/out.xml");

$reader->process([
'.*/UpdatedProducts/ProductId' => function (SimpleXMLElement $data): void {
$this->saveToDb((string)$data, "update");
},
'.*/RemovedProducts/ProductId' => function (SimpleXMLElement $data): void {
$this->saveToDb((string)$data, "remove");
},
]);

$reader->close();

Parsing Huge XML Files in PHP

There are only two php APIs that are really suited for processing large files. The first is the old expat api, and the second is the newer XMLreader functions. These apis read continuous streams rather than loading the entire tree into memory (which is what simplexml and DOM does).

For an example, you might want to look at this partial parser of the DMOZ-catalog:

<?php

class SimpleDMOZParser
{
protected $_stack = array();
protected $_file = "";
protected $_parser = null;

protected $_currentId = "";
protected $_current = "";

public function __construct($file)
{
$this->_file = $file;

$this->_parser = xml_parser_create("UTF-8");
xml_set_object($this->_parser, $this);
xml_set_element_handler($this->_parser, "startTag", "endTag");
}

public function startTag($parser, $name, $attribs)
{
array_push($this->_stack, $this->_current);

if ($name == "TOPIC" && count($attribs)) {
$this->_currentId = $attribs["R:ID"];
}

if ($name == "LINK" && strpos($this->_currentId, "Top/Home/Consumer_Information/Electronics/") === 0) {
echo $attribs["R:RESOURCE"] . "\n";
}

$this->_current = $name;
}

public function endTag($parser, $name)
{
$this->_current = array_pop($this->_stack);
}

public function parse()
{
$fh = fopen($this->_file, "r");
if (!$fh) {
die("Epic fail!\n");
}

while (!feof($fh)) {
$data = fread($fh, 4096);
xml_parse($this->_parser, $data, feof($fh));
}
}
}

$parser = new SimpleDMOZParser("content.rdf.u8");
$parser->parse();

PHP: Parsing Huge Xml without memory

Instead of XMLReader use XML Parser. It allows you to parse xml by chunks, so it if very memory efficient. Here is working example, that looks for <ATTACHMENT> tags and decodes it's contents into file. Dealing with base64 is easy, just remember that it turns every 3 characters into 4 character encoded string, so as long as you are supplying chunks of length dividable by 4, you can concatenate decoded result.

<?php

class ExtractAttachments {

private $parser;
private $tmpFile;
private $tmpHandle;
private $buffer;

private $files = array();

public function __construct($xml) {
$this->parser = xml_parser_create('UTF-8');
xml_set_object($this->parser, $this);
xml_set_element_handler($this->parser, 'tag_start', 'tag_end');
xml_set_character_data_handler($this->parser, 'cdata');
$handle = fopen($xml, 'rb');
while($string = fread($handle, 4096)) {
xml_parse($this->parser, $string, false);
}
xml_parse($this->parser, '', true);
fclose($handle);
xml_parser_free($this->parser);
}

public function tag_start($parser, $tag, $attr) {
if($tag == 'ATTACHMENT') {
$this->tmpFile = tempnam(__DIR__, 'xml');
$this->tmpHandle = fopen($this->tmpFile, 'wb');
}
}

public function tag_end($parser, $tag) {
if($this->tmpHandle) {
if($this->buffer) {
fwrite($this->tmpHandle, base64_decode($this->buffer));
$this->buffer = '';
}
fclose($this->tmpHandle);
$this->tmpHandle = null;
$this->files[] = $this->tmpFile;
}
}

public function cdata($parser, $data) {
if ($this->tmpHandle) {
$data = trim($data);
if($this->buffer) {
$data = $this->buffer . $data;
$this->buffer = '';
}
if (0 != ($modulo = strlen($data)%4)) {
$this->buffer = substr($data, -$modulo);
$data = substr($data, 0, -$modulo);
}
fwrite($this->tmpHandle, base64_decode($data));
}
}

public function getFiles(){
return $this->files;
}
}

$xml = new ExtractAttachments('large.xml');
$xml->getFiles();


Related Topics



Leave a reply



Submit