Best Way to Process Large Xml in PHP

Best way to process large XML in PHP

For a large file, you'll want to use a SAX parser rather than a DOM parser.

With a DOM parser it will read in the whole file and load it into an object tree in memory. With a SAX parser, it will read the file sequentially and call your user-defined callback functions to handle the data (start tags, end tags, CDATA, etc.)

With a SAX parser you'll need to maintain state yourself (e.g. what tag you are currently in) which makes it a bit more complicated, but for a large file it will be much more efficient memory wise.

Parsing Huge XML Files in PHP

There are only two php APIs that are really suited for processing large files. The first is the old expat api, and the second is the newer XMLreader functions. These apis read continuous streams rather than loading the entire tree into memory (which is what simplexml and DOM does).

For an example, you might want to look at this partial parser of the DMOZ-catalog:

<?php

class SimpleDMOZParser
{
    protected $_stack = array();
    protected $_file = "";
    protected $_parser = null;

    protected $_currentId = "";
    protected $_current = "";

    public function __construct($file)
    {
        $this->_file = $file;

        $this->_parser = xml_parser_create("UTF-8");
        xml_set_object($this->_parser, $this);
        xml_set_element_handler($this->_parser, "startTag", "endTag");
    }

    public function startTag($parser, $name, $attribs)
    {
        array_push($this->_stack, $this->_current);

        if ($name == "TOPIC" && count($attribs)) {
            $this->_currentId = $attribs["R:ID"];
        }

        if ($name == "LINK" && strpos($this->_currentId, "Top/Home/Consumer_Information/Electronics/") === 0) {
            echo $attribs["R:RESOURCE"] . "\n";
        }

        $this->_current = $name;
    }

    public function endTag($parser, $name)
    {
        $this->_current = array_pop($this->_stack);
    }

    public function parse()
    {
        $fh = fopen($this->_file, "r");
        if (!$fh) {
            die("Epic fail!\n");
        }

        while (!feof($fh)) {
            $data = fread($fh, 4096);
            xml_parse($this->_parser, $data, feof($fh));
        }
    }
}

$parser = new SimpleDMOZParser("content.rdf.u8");
$parser->parse();

Parse Large XML File in PHP Efficiently to Generate SQL

Alright, I have a working example for you with much improvement in execution speed, memory usage, and database load:

<?php
define('INSERT_BATCH_SIZE', 500);
define('DRUG_XML_FILE', 'drugbank.xml');

$servername = "localhost"; // Example : localhost
$username   = "root";
$password   = "pass";
$dbname     = "dbname";

function parseXml($mysql)
{
    $drugs = array();

    $xmlReader = new XMLReader();
    $xmlReader->open(DRUG_XML_FILE);

    // Move our pointer to the first <drug /> element.
    while ($xmlReader->read() && $xmlReader->name !== 'drug') ;

    $drugCount = 0;
    $totalDrugs = 0;

    // Iterate over the outer <drug /> elements.
    while ($xmlReader->name == 'drug')
    {
        // Convert the node into a SimpleXMLElement for ease of use.
        $item = new SimpleXMLElement($xmlReader->readOuterXML());

        $name = $item->name;
        $description = $item->description;
        $casNumber = $item->{'cas-number'};
        $created = $item['created'];
        $updated = $item['updated'];
        $type = $item['type'];

        $drugs[] = "('$name', '$description','$casNumber','$created','$updated','$type')";
        $drugCount++;
        $totalDrugs++;

        // Once we've reached the desired batch size, insert the batch and reset the counter.
        if ($drugCount >= INSERT_BATCH_SIZE)
        {
            batchInsertDrugs($mysql, $drugs);
            $drugCount = 0;
        }

        // Go to next <drug />.
        $xmlReader->next('drug');
    }

    $xmlReader->close();

    // Insert the leftovers from the last batch.
    batchInsertDrugs($mysql, $drugs);

    echo "Inserted $totalDrugs total drugs.";
}

function batchInsertDrugs($mysql, &$drugs)
{
    // Generate a batched INSERT statement.
    $statement = "INSERT INTO `drug` (name, description, cas_number, created, updated, type) VALUES";
    $statement = $statement . ' ' . implode(",\n", $drugs);

    echo $statement, "\n";

    // Run the batch INSERT.
    if ($mysql->query($statement))
    {
        echo "Inserted " . count($drugs) . " drugs.";
    }
    else
    {
        echo "INSERT Error: " . $statement . "<br>" . $mysql->error. "<br>" ;
    }

    // Clear the buffer.
    $drugs = array();
}

// Create MySQL connection.
$mysql = new mysqli($servername, $username, $password, $dbname);
if ($mysql->connect_error)
{
    die("Connection failed: " . $mysql->connect_error);
}

parseXml($mysql);

I tested this example using the same dataset.
Using SimpleXML in the way that you are leads to parsing the entire document in memory, which is slow and memory-intensive. This approach uses XMLReader, which is a fast pull-parser. You can probably make this faster still using the PHP SAX XML Parser, but it's a bit more complex of a pattern, and the above example will be noticeably better than what you started with.

The other significant change in my example is that we're using MySQL Batched Inserts, so we only actually hit the database every 500 (configurable) items we process. You can tweak this number for better performance. After a certain point, the query will become too large for MySQL to process, but you may be able to do a lot more than 500 at one time.

If you'd like me to explain any part of this further, or if you have any problems with it, just let me know in the comments! :)

Parsing extremely large XML files in php

In PHP, you can read in extreme large XML files with the XMLReader^Docs:

$reader = new XMLReader();
$reader->open($xmlfile);

Extreme large XML files should be stored in a compressed format on disk. At least this makes sense as XML files have a high compression ratio. For example gzipped like large.xml.gz.

PHP supports that quite well with XMLReader via the compression wrappers^Docs:

$xmlfile = 'compress.zlib://path/to/large.xml.gz';

$reader = new XMLReader();
$reader->open($xmlfile);

The XMLReader allows you to operate on the current element "only". That means it's forward-only. If you need to keep parser state, you need to build it your own.

I often find it helpful to wrap the basic movements into a set of iterators that know how to operate on XMLReader like iterating through elements or child-elements only. You find this outlined in Parse XML with PHP and XMLReader.

See as well:

PHP open gzipped XML

How to edit large XML files in PHP based on a record in the XML Node

Goal

Desired result: I want to create a new XML file with only the records where the child "ShowOnWebsite" is true.

Given

test.xml

<Items>
<Item>
  <Barcode>...</Barcode>
  <BrandCode>...</BrandCode>
  <Title>...</Title>
  <Content>...</Content>
  <ShowOnWebsite>false</ShowOnWebsite>
</Item> 
<Item>
  <Barcode>...</Barcode>
  <BrandCode>...</BrandCode>
  <Title>...</Title>
  <Content>...</Content>
  <ShowOnWebsite>true</ShowOnWebsite>
</Item> 
<Item>
  <Barcode>...</Barcode>
  <BrandCode>...</BrandCode>
  <Title>...</Title>
  <Content>...</Content>
  <ShowOnWebsite>false</ShowOnWebsite>
</Item>
</Items>

Code

This is the implementation I wrote. The getItems yields the childs without loading the xml at once into the memory.

function getItems($fileName) {
    if ($file = fopen($fileName, "r")) {
        $buffer = "";
        $active = false;
        while(!feof($file)) {
            $line = fgets($file);
            $line = trim(str_replace(["\r", "\n"], "", $line));
            if($line == "<Item>") {
                $buffer .= $line;
                $active = true;
            } elseif($line == "</Item>") {
                $buffer .= $line;
                $active = false;
                yield new SimpleXMLElement($buffer);
                $buffer = "";
            } elseif($active == true) {
                $buffer .= $line;
            }
        }
        fclose($file);
    }   
}

$output = new SimpleXMLElement('<?xml version="1.0" encoding="utf-8"?><Items></Items>');
foreach(getItems("test.xml") as $element)
{
    if($element->ShowOnWebsite == "true") {
        $item = $output->addChild('Item');
        $item->addChild('Barcode', (string) $element->Barcode);
        $item->addChild('BrandCode', (string) $element->BrandCode);
        $item->addChild('Title', (string) $element->Title);
        $item->addChild('Content', (string) $element->Content);
        $item->addChild('ShowOnWebsite', $element->ShowOnWebsite);
    }
}

$fileName = __DIR__ . "/test_" . rand(100, 999999) . ".xml";
$output->asXML($fileName);

Output

<?xml version="1.0" encoding="utf-8"?>
<Items><Item><Barcode>...</Barcode><BrandCode>...</BrandCode><Title>...</Title><Content>...</Content><ShowOnWebsite>true</ShowOnWebsite></Item></Items>

Processing large XML files using PHP

Using a while loop to go through the entire $db_xml document each time you need to search for a product is inefficient. Importing the second XML file to an SQL table is not a bad idea, but it seems a bit annoying when you can actually use a PHP array indexed by product_id.

I've prepared some code to illustrate my point:

<?php

$zasoby_xml = file_get_contents('zasoby.xml');

$sxe0 = new SimpleXMLElement($zasoby_xml);
$sxe0->registerXPathNamespace('lStk', 'http://www.stormware.cz/schema/version_2/list_stock.xsd');
$lStkStock = $sxe0->xpath('//lStk:stock');
$cnt = count($lStkStock);

$sxe = new SimpleXMLElement($zasoby_xml);
$sxe->registerXPathNamespace('stk', 'http://www.stormware.cz/schema/version_2/stock.xsd');
$stkCode = $sxe->xpath('//stk:code'); // product code
$stkName = $sxe->xpath('//stk:name'); // product name
$stkCount = $sxe->xpath('//stk:count'); // count in the stock

$db_xml = simplexml_load_file('db.xml');

// Loop through record elements on db.xml to build an array that can be accessed by product_id

$records = array();

foreach ($db_xml->record as $record) {
    $records[(string)$record->product_id] = $record;
}

// Loop through all products to display their information

for ($i = 0; $i < $cnt; $i++) {

    // Display only products in stock

    if ($stkCount[$i] > 0) {

        // Access this record directly by product_id (code) instead of looping through all records in db.xml

        if (isset($records[(string)$stkCode[$i]])) {
            echo sprintf(
                "<b>Code</b> %s <b>Category</b> %s", 
                $stkCode[$i], $records[(string)$stkCode[$i]]->category_path
            );
        }
    }
}

?>

zasoby.xml

<?xml version="1.0" encoding="Windows-1250"?>
<rsp:responsePack version="2.0" id="Usr01" state="ok" note="46895680" programVersion="10608.3 E1 (13.3.2014)" xmlns:rsp="http://www.stormware.cz/schema/version_2/response.xsd" xmlns:lStk="http://www.stormware.cz/schema/version_2/list_stock.xsd" xmlns:stk="http://www.stormware.cz/schema/version_2/stock.xsd">
<rsp:responsePackItem version="2.0" id="Usr01" state="ok">
<lStk:listStock version="2.0" dateTimeStamp="2014-04-08T14:18:14" dateValidFrom="2014-04-08" state="ok">
<lStk:stock version="2.0">
    <stk:code>90000000</stk:code>
    <stk:count>975.0</stk:count>
    <stk:name>Product name</stk:name>
</lStk:stock>
</lStk:listStock></rsp:responsePackItem></rsp:responsePack>

db.xml

<?xml version="1.0" encoding="utf-8" ?>
<data>
<record>
    <product_id><![CDATA[90000000]]></product_id>
    <name><![CDATA[productname]]></name>
    <Deeplink><![CDATA[product url]]></Deeplink>
    <Img_url><![CDATA[product img_url]]></Img_url>
    <category_path><![CDATA[product category]]></category_path>
    <Price><![CDATA[product price]]></Price>
</record>
</data>

With these XML files I'm getting the following output:

Code 90000000 Category product category

A problem with this implementation is the memory consumption of the $records array. If the second XML file gets too big you are going to end up with an array of thousands of elements. If this problem arises you could solve it by building an SQLite database file on disk instead of an array, or maybe not storing the full SimpleXMLElement $record object in the array under each product_id key.

EDIT: Fixed an error in line 23 of the script.

How to parse a large XML file

For parsing large documents like this I suggest using a streaming parser like XMLReader which will allow you to parse XML without loading the entire file into memory at once. By using its expand() method it's easy to use it in hand with the DOM API.

Tree-based parsers like the DOM are very fast, but take up more memory as the entire document must be loaded up. Streaming parsers like XMLReader keep the memory use down as you're only grabbing a bit of the document at a time, but the trade off is longer processing time.

By using both you can adjust how you use each in tandem in order to get under any hard bounds like memory limits while minimizing processing time.

Example:

$dom    = new DOMDocument();
$xpath  = new DOMXPath($dom);
$reader = new XMLReader();
$reader->open('file.xml');

while ($reader->read()) {
    if ($reader->nodeType == XMLReader::ELEMENT && $reader->name == 'Book') {
        $node = $dom->importNode($reader->expand(), true);
        $result = $xpath->evaluate(
            'string(self::Book[BookCode = "AD0WNR"]/Subject)',
            $node
        );
        if ($result) {
            echo $result;
            $reader->close();
            break;
        }
    }
}

What this is doing is iterating through the nodes in the XML. Whenever it hits an element <Book> we:

Import that into the DOM.
Evaluate the XPath expression^*.

If the XPath expression found what we're looking for:

Print the result.
Close the file.
Break the read loop.

#2 and #3 we do because we're only looking for a single result. If you have more you want to find, remove those and keep on trucking.

(* I've replace the initial double forward slash from the XPath expression with self:: to act on the context node passed as the second parameter to evaluate() - thanks, @ThW)

Best Way to Process Large Xml in PHP