Parsing Extremely Large Xml Files in PHP

Parsing Huge XML Files in PHP

There are only two php APIs that are really suited for processing large files. The first is the old expat api, and the second is the newer XMLreader functions. These apis read continuous streams rather than loading the entire tree into memory (which is what simplexml and DOM does).

For an example, you might want to look at this partial parser of the DMOZ-catalog:

<?php

class SimpleDMOZParser
{
    protected $_stack = array();
    protected $_file = "";
    protected $_parser = null;

    protected $_currentId = "";
    protected $_current = "";

    public function __construct($file)
    {
        $this->_file = $file;

        $this->_parser = xml_parser_create("UTF-8");
        xml_set_object($this->_parser, $this);
        xml_set_element_handler($this->_parser, "startTag", "endTag");
    }

    public function startTag($parser, $name, $attribs)
    {
        array_push($this->_stack, $this->_current);

        if ($name == "TOPIC" && count($attribs)) {
            $this->_currentId = $attribs["R:ID"];
        }

        if ($name == "LINK" && strpos($this->_currentId, "Top/Home/Consumer_Information/Electronics/") === 0) {
            echo $attribs["R:RESOURCE"] . "\n";
        }

        $this->_current = $name;
    }

    public function endTag($parser, $name)
    {
        $this->_current = array_pop($this->_stack);
    }

    public function parse()
    {
        $fh = fopen($this->_file, "r");
        if (!$fh) {
            die("Epic fail!\n");
        }

        while (!feof($fh)) {
            $data = fread($fh, 4096);
            xml_parse($this->_parser, $data, feof($fh));
        }
    }
}

$parser = new SimpleDMOZParser("content.rdf.u8");
$parser->parse();

Parsing extremely large XML files in php

In PHP, you can read in extreme large XML files with the XMLReader^Docs:

$reader = new XMLReader();
$reader->open($xmlfile);

Extreme large XML files should be stored in a compressed format on disk. At least this makes sense as XML files have a high compression ratio. For example gzipped like large.xml.gz.

PHP supports that quite well with XMLReader via the compression wrappers^Docs:

$xmlfile = 'compress.zlib://path/to/large.xml.gz';

$reader = new XMLReader();
$reader->open($xmlfile);

The XMLReader allows you to operate on the current element "only". That means it's forward-only. If you need to keep parser state, you need to build it your own.

I often find it helpful to wrap the basic movements into a set of iterators that know how to operate on XMLReader like iterating through elements or child-elements only. You find this outlined in Parse XML with PHP and XMLReader.

See as well:

PHP open gzipped XML

Parse Large XML File in PHP Efficiently to Generate SQL

Alright, I have a working example for you with much improvement in execution speed, memory usage, and database load:

<?php
define('INSERT_BATCH_SIZE', 500);
define('DRUG_XML_FILE', 'drugbank.xml');

$servername = "localhost"; // Example : localhost
$username   = "root";
$password   = "pass";
$dbname     = "dbname";

function parseXml($mysql)
{
    $drugs = array();

    $xmlReader = new XMLReader();
    $xmlReader->open(DRUG_XML_FILE);

    // Move our pointer to the first <drug /> element.
    while ($xmlReader->read() && $xmlReader->name !== 'drug') ;

    $drugCount = 0;
    $totalDrugs = 0;

    // Iterate over the outer <drug /> elements.
    while ($xmlReader->name == 'drug')
    {
        // Convert the node into a SimpleXMLElement for ease of use.
        $item = new SimpleXMLElement($xmlReader->readOuterXML());

        $name = $item->name;
        $description = $item->description;
        $casNumber = $item->{'cas-number'};
        $created = $item['created'];
        $updated = $item['updated'];
        $type = $item['type'];

        $drugs[] = "('$name', '$description','$casNumber','$created','$updated','$type')";
        $drugCount++;
        $totalDrugs++;

        // Once we've reached the desired batch size, insert the batch and reset the counter.
        if ($drugCount >= INSERT_BATCH_SIZE)
        {
            batchInsertDrugs($mysql, $drugs);
            $drugCount = 0;
        }

        // Go to next <drug />.
        $xmlReader->next('drug');
    }

    $xmlReader->close();

    // Insert the leftovers from the last batch.
    batchInsertDrugs($mysql, $drugs);

    echo "Inserted $totalDrugs total drugs.";
}

function batchInsertDrugs($mysql, &$drugs)
{
    // Generate a batched INSERT statement.
    $statement = "INSERT INTO `drug` (name, description, cas_number, created, updated, type) VALUES";
    $statement = $statement . ' ' . implode(",\n", $drugs);

    echo $statement, "\n";

    // Run the batch INSERT.
    if ($mysql->query($statement))
    {
        echo "Inserted " . count($drugs) . " drugs.";
    }
    else
    {
        echo "INSERT Error: " . $statement . "<br>" . $mysql->error. "<br>" ;
    }

    // Clear the buffer.
    $drugs = array();
}

// Create MySQL connection.
$mysql = new mysqli($servername, $username, $password, $dbname);
if ($mysql->connect_error)
{
    die("Connection failed: " . $mysql->connect_error);
}

parseXml($mysql);

I tested this example using the same dataset.
Using SimpleXML in the way that you are leads to parsing the entire document in memory, which is slow and memory-intensive. This approach uses XMLReader, which is a fast pull-parser. You can probably make this faster still using the PHP SAX XML Parser, but it's a bit more complex of a pattern, and the above example will be noticeably better than what you started with.

The other significant change in my example is that we're using MySQL Batched Inserts, so we only actually hit the database every 500 (configurable) items we process. You can tweak this number for better performance. After a certain point, the query will become too large for MySQL to process, but you may be able to do a lot more than 500 at one time.

If you'd like me to explain any part of this further, or if you have any problems with it, just let me know in the comments! :)

How to parse a large XML file

For parsing large documents like this I suggest using a streaming parser like XMLReader which will allow you to parse XML without loading the entire file into memory at once. By using its expand() method it's easy to use it in hand with the DOM API.

Tree-based parsers like the DOM are very fast, but take up more memory as the entire document must be loaded up. Streaming parsers like XMLReader keep the memory use down as you're only grabbing a bit of the document at a time, but the trade off is longer processing time.

By using both you can adjust how you use each in tandem in order to get under any hard bounds like memory limits while minimizing processing time.

Example:

$dom    = new DOMDocument();
$xpath  = new DOMXPath($dom);
$reader = new XMLReader();
$reader->open('file.xml');

while ($reader->read()) {
    if ($reader->nodeType == XMLReader::ELEMENT && $reader->name == 'Book') {
        $node = $dom->importNode($reader->expand(), true);
        $result = $xpath->evaluate(
            'string(self::Book[BookCode = "AD0WNR"]/Subject)',
            $node
        );
        if ($result) {
            echo $result;
            $reader->close();
            break;
        }
    }
}

What this is doing is iterating through the nodes in the XML. Whenever it hits an element <Book> we:

Import that into the DOM.
Evaluate the XPath expression^*.

If the XPath expression found what we're looking for:

Print the result.
Close the file.
Break the read loop.

#2 and #3 we do because we're only looking for a single result. If you have more you want to find, remove those and keep on trucking.

(* I've replace the initial double forward slash from the XPath expression with self:: to act on the context node passed as the second parameter to evaluate() - thanks, @ThW)

Best way to process large XML in PHP

For a large file, you'll want to use a SAX parser rather than a DOM parser.

With a DOM parser it will read in the whole file and load it into an object tree in memory. With a SAX parser, it will read the file sequentially and call your user-defined callback functions to handle the data (start tags, end tags, CDATA, etc.)

With a SAX parser you'll need to maintain state yourself (e.g. what tag you are currently in) which makes it a bit more complicated, but for a large file it will be much more efficient memory wise.

How to edit large XML files in PHP based on a record in the XML Node

Goal

Desired result: I want to create a new XML file with only the records where the child "ShowOnWebsite" is true.

Given

test.xml

<Items>
<Item>
  <Barcode>...</Barcode>
  <BrandCode>...</BrandCode>
  <Title>...</Title>
  <Content>...</Content>
  <ShowOnWebsite>false</ShowOnWebsite>
</Item> 
<Item>
  <Barcode>...</Barcode>
  <BrandCode>...</BrandCode>
  <Title>...</Title>
  <Content>...</Content>
  <ShowOnWebsite>true</ShowOnWebsite>
</Item> 
<Item>
  <Barcode>...</Barcode>
  <BrandCode>...</BrandCode>
  <Title>...</Title>
  <Content>...</Content>
  <ShowOnWebsite>false</ShowOnWebsite>
</Item>
</Items>

Code

This is the implementation I wrote. The getItems yields the childs without loading the xml at once into the memory.

function getItems($fileName) {
    if ($file = fopen($fileName, "r")) {
        $buffer = "";
        $active = false;
        while(!feof($file)) {
            $line = fgets($file);
            $line = trim(str_replace(["\r", "\n"], "", $line));
            if($line == "<Item>") {
                $buffer .= $line;
                $active = true;
            } elseif($line == "</Item>") {
                $buffer .= $line;
                $active = false;
                yield new SimpleXMLElement($buffer);
                $buffer = "";
            } elseif($active == true) {
                $buffer .= $line;
            }
        }
        fclose($file);
    }   
}

$output = new SimpleXMLElement('<?xml version="1.0" encoding="utf-8"?><Items></Items>');
foreach(getItems("test.xml") as $element)
{
    if($element->ShowOnWebsite == "true") {
        $item = $output->addChild('Item');
        $item->addChild('Barcode', (string) $element->Barcode);
        $item->addChild('BrandCode', (string) $element->BrandCode);
        $item->addChild('Title', (string) $element->Title);
        $item->addChild('Content', (string) $element->Content);
        $item->addChild('ShowOnWebsite', $element->ShowOnWebsite);
    }
}

$fileName = __DIR__ . "/test_" . rand(100, 999999) . ".xml";
$output->asXML($fileName);

Output

<?xml version="1.0" encoding="utf-8"?>
<Items><Item><Barcode>...</Barcode><BrandCode>...</BrandCode><Title>...</Title><Content>...</Content><ShowOnWebsite>true</ShowOnWebsite></Item></Items>

Parsing Huge XML Files in PHP

For an example, you might want to look at this partial parser of the DMOZ-catalog:

<?php

class SimpleDMOZParser
{
    protected $_stack = array();
    protected $_file = "";
    protected $_parser = null;

    protected $_currentId = "";
    protected $_current = "";

    public function __construct($file)
    {
        $this->_file = $file;

        $this->_parser = xml_parser_create("UTF-8");
        xml_set_object($this->_parser, $this);
        xml_set_element_handler($this->_parser, "startTag", "endTag");
    }

    public function startTag($parser, $name, $attribs)
    {
        array_push($this->_stack, $this->_current);

        if ($name == "TOPIC" && count($attribs)) {
            $this->_currentId = $attribs["R:ID"];
        }

        if ($name == "LINK" && strpos($this->_currentId, "Top/Home/Consumer_Information/Electronics/") === 0) {
            echo $attribs["R:RESOURCE"] . "\n";
        }

        $this->_current = $name;
    }

    public function endTag($parser, $name)
    {
        $this->_current = array_pop($this->_stack);
    }

    public function parse()
    {
        $fh = fopen($this->_file, "r");
        if (!$fh) {
            die("Epic fail!\n");
        }

        while (!feof($fh)) {
            $data = fread($fh, 4096);
            xml_parse($this->_parser, $data, feof($fh));
        }
    }
}

$parser = new SimpleDMOZParser("content.rdf.u8");
$parser->parse();

How can I parse large XML files in PHP?

Switch from using SimpleXML to XMLReader when working with large XML files. This is a Pull parser that will not load the entire file into memory to process it.

Parsing Extremely Large Xml Files in PHP

Parsing Huge XML Files in PHP

Parsing extremely large XML files in php

Parse Large XML File in PHP Efficiently to Generate SQL

How to parse a large XML file

Example:

Best way to process large XML in PHP

How to edit large XML files in PHP based on a record in the XML Node

Goal

Given

test.xml

Code

Output

Parsing Huge XML Files in PHP

How can I parse large XML files in PHP?

Related Topics

Leave a reply