Best way to process large XML in PHP
For a large file, you'll want to use a SAX parser rather than a DOM parser.
With a DOM parser it will read in the whole file and load it into an object tree in memory. With a SAX parser, it will read the file sequentially and call your user-defined callback functions to handle the data (start tags, end tags, CDATA, etc.)
With a SAX parser you'll need to maintain state yourself (e.g. what tag you are currently in) which makes it a bit more complicated, but for a large file it will be much more efficient memory wise.
Parsing Huge XML Files in PHP
There are only two php APIs that are really suited for processing large files. The first is the old expat api, and the second is the newer XMLreader functions. These apis read continuous streams rather than loading the entire tree into memory (which is what simplexml and DOM does).
For an example, you might want to look at this partial parser of the DMOZ-catalog:
<?php
class SimpleDMOZParser
{
protected $_stack = array();
protected $_file = "";
protected $_parser = null;
protected $_currentId = "";
protected $_current = "";
public function __construct($file)
{
$this->_file = $file;
$this->_parser = xml_parser_create("UTF-8");
xml_set_object($this->_parser, $this);
xml_set_element_handler($this->_parser, "startTag", "endTag");
}
public function startTag($parser, $name, $attribs)
{
array_push($this->_stack, $this->_current);
if ($name == "TOPIC" && count($attribs)) {
$this->_currentId = $attribs["R:ID"];
}
if ($name == "LINK" && strpos($this->_currentId, "Top/Home/Consumer_Information/Electronics/") === 0) {
echo $attribs["R:RESOURCE"] . "\n";
}
$this->_current = $name;
}
public function endTag($parser, $name)
{
$this->_current = array_pop($this->_stack);
}
public function parse()
{
$fh = fopen($this->_file, "r");
if (!$fh) {
die("Epic fail!\n");
}
while (!feof($fh)) {
$data = fread($fh, 4096);
xml_parse($this->_parser, $data, feof($fh));
}
}
}
$parser = new SimpleDMOZParser("content.rdf.u8");
$parser->parse();
Parse Large XML File in PHP Efficiently to Generate SQL
Alright, I have a working example for you with much improvement in execution speed, memory usage, and database load:
<?php
define('INSERT_BATCH_SIZE', 500);
define('DRUG_XML_FILE', 'drugbank.xml');
$servername = "localhost"; // Example : localhost
$username = "root";
$password = "pass";
$dbname = "dbname";
function parseXml($mysql)
{
$drugs = array();
$xmlReader = new XMLReader();
$xmlReader->open(DRUG_XML_FILE);
// Move our pointer to the first <drug /> element.
while ($xmlReader->read() && $xmlReader->name !== 'drug') ;
$drugCount = 0;
$totalDrugs = 0;
// Iterate over the outer <drug /> elements.
while ($xmlReader->name == 'drug')
{
// Convert the node into a SimpleXMLElement for ease of use.
$item = new SimpleXMLElement($xmlReader->readOuterXML());
$name = $item->name;
$description = $item->description;
$casNumber = $item->{'cas-number'};
$created = $item['created'];
$updated = $item['updated'];
$type = $item['type'];
$drugs[] = "('$name', '$description','$casNumber','$created','$updated','$type')";
$drugCount++;
$totalDrugs++;
// Once we've reached the desired batch size, insert the batch and reset the counter.
if ($drugCount >= INSERT_BATCH_SIZE)
{
batchInsertDrugs($mysql, $drugs);
$drugCount = 0;
}
// Go to next <drug />.
$xmlReader->next('drug');
}
$xmlReader->close();
// Insert the leftovers from the last batch.
batchInsertDrugs($mysql, $drugs);
echo "Inserted $totalDrugs total drugs.";
}
function batchInsertDrugs($mysql, &$drugs)
{
// Generate a batched INSERT statement.
$statement = "INSERT INTO `drug` (name, description, cas_number, created, updated, type) VALUES";
$statement = $statement . ' ' . implode(",\n", $drugs);
echo $statement, "\n";
// Run the batch INSERT.
if ($mysql->query($statement))
{
echo "Inserted " . count($drugs) . " drugs.";
}
else
{
echo "INSERT Error: " . $statement . "<br>" . $mysql->error. "<br>" ;
}
// Clear the buffer.
$drugs = array();
}
// Create MySQL connection.
$mysql = new mysqli($servername, $username, $password, $dbname);
if ($mysql->connect_error)
{
die("Connection failed: " . $mysql->connect_error);
}
parseXml($mysql);
I tested this example using the same dataset.
Using SimpleXML in the way that you are leads to parsing the entire document in memory, which is slow and memory-intensive. This approach uses XMLReader, which is a fast pull-parser. You can probably make this faster still using the PHP SAX XML Parser, but it's a bit more complex of a pattern, and the above example will be noticeably better than what you started with.
The other significant change in my example is that we're using MySQL Batched Inserts, so we only actually hit the database every 500
(configurable) items we process. You can tweak this number for better performance. After a certain point, the query will become too large for MySQL to process, but you may be able to do a lot more than 500
at one time.
If you'd like me to explain any part of this further, or if you have any problems with it, just let me know in the comments! :)
Parsing extremely large XML files in php
In PHP, you can read in extreme large XML files with the XMLReader
Docs:
$reader = new XMLReader();
$reader->open($xmlfile);
Extreme large XML files should be stored in a compressed format on disk. At least this makes sense as XML files have a high compression ratio. For example gzipped like large.xml.gz
.
PHP supports that quite well with XMLReader
via the compression wrappersDocs:
$xmlfile = 'compress.zlib://path/to/large.xml.gz';
$reader = new XMLReader();
$reader->open($xmlfile);
The XMLReader
allows you to operate on the current element "only". That means it's forward-only. If you need to keep parser state, you need to build it your own.
I often find it helpful to wrap the basic movements into a set of iterators that know how to operate on XMLReader
like iterating through elements or child-elements only. You find this outlined in Parse XML with PHP and XMLReader.
See as well:
- PHP open gzipped XML
How to edit large XML files in PHP based on a record in the XML Node
Goal
Desired result: I want to create a new XML file with only the records where the child "ShowOnWebsite" is true.
Given
test.xml
<Items>
<Item>
<Barcode>...</Barcode>
<BrandCode>...</BrandCode>
<Title>...</Title>
<Content>...</Content>
<ShowOnWebsite>false</ShowOnWebsite>
</Item>
<Item>
<Barcode>...</Barcode>
<BrandCode>...</BrandCode>
<Title>...</Title>
<Content>...</Content>
<ShowOnWebsite>true</ShowOnWebsite>
</Item>
<Item>
<Barcode>...</Barcode>
<BrandCode>...</BrandCode>
<Title>...</Title>
<Content>...</Content>
<ShowOnWebsite>false</ShowOnWebsite>
</Item>
</Items>
Code
This is the implementation I wrote. The getItems
yields the childs without loading the xml at once into the memory.
function getItems($fileName) {
if ($file = fopen($fileName, "r")) {
$buffer = "";
$active = false;
while(!feof($file)) {
$line = fgets($file);
$line = trim(str_replace(["\r", "\n"], "", $line));
if($line == "<Item>") {
$buffer .= $line;
$active = true;
} elseif($line == "</Item>") {
$buffer .= $line;
$active = false;
yield new SimpleXMLElement($buffer);
$buffer = "";
} elseif($active == true) {
$buffer .= $line;
}
}
fclose($file);
}
}
$output = new SimpleXMLElement('<?xml version="1.0" encoding="utf-8"?><Items></Items>');
foreach(getItems("test.xml") as $element)
{
if($element->ShowOnWebsite == "true") {
$item = $output->addChild('Item');
$item->addChild('Barcode', (string) $element->Barcode);
$item->addChild('BrandCode', (string) $element->BrandCode);
$item->addChild('Title', (string) $element->Title);
$item->addChild('Content', (string) $element->Content);
$item->addChild('ShowOnWebsite', $element->ShowOnWebsite);
}
}
$fileName = __DIR__ . "/test_" . rand(100, 999999) . ".xml";
$output->asXML($fileName);
Output
<?xml version="1.0" encoding="utf-8"?>
<Items><Item><Barcode>...</Barcode><BrandCode>...</BrandCode><Title>...</Title><Content>...</Content><ShowOnWebsite>true</ShowOnWebsite></Item></Items>
Processing large XML files using PHP
Using a while
loop to go through the entire $db_xml
document each time you need to search for a product is inefficient. Importing the second XML file to an SQL table is not a bad idea, but it seems a bit annoying when you can actually use a PHP array indexed by product_id
.
I've prepared some code to illustrate my point:
<?php
$zasoby_xml = file_get_contents('zasoby.xml');
$sxe0 = new SimpleXMLElement($zasoby_xml);
$sxe0->registerXPathNamespace('lStk', 'http://www.stormware.cz/schema/version_2/list_stock.xsd');
$lStkStock = $sxe0->xpath('//lStk:stock');
$cnt = count($lStkStock);
$sxe = new SimpleXMLElement($zasoby_xml);
$sxe->registerXPathNamespace('stk', 'http://www.stormware.cz/schema/version_2/stock.xsd');
$stkCode = $sxe->xpath('//stk:code'); // product code
$stkName = $sxe->xpath('//stk:name'); // product name
$stkCount = $sxe->xpath('//stk:count'); // count in the stock
$db_xml = simplexml_load_file('db.xml');
// Loop through record elements on db.xml to build an array that can be accessed by product_id
$records = array();
foreach ($db_xml->record as $record) {
$records[(string)$record->product_id] = $record;
}
// Loop through all products to display their information
for ($i = 0; $i < $cnt; $i++) {
// Display only products in stock
if ($stkCount[$i] > 0) {
// Access this record directly by product_id (code) instead of looping through all records in db.xml
if (isset($records[(string)$stkCode[$i]])) {
echo sprintf(
"<b>Code</b> %s <b>Category</b> %s",
$stkCode[$i], $records[(string)$stkCode[$i]]->category_path
);
}
}
}
?>
zasoby.xml
<?xml version="1.0" encoding="Windows-1250"?>
<rsp:responsePack version="2.0" id="Usr01" state="ok" note="46895680" programVersion="10608.3 E1 (13.3.2014)" xmlns:rsp="http://www.stormware.cz/schema/version_2/response.xsd" xmlns:lStk="http://www.stormware.cz/schema/version_2/list_stock.xsd" xmlns:stk="http://www.stormware.cz/schema/version_2/stock.xsd">
<rsp:responsePackItem version="2.0" id="Usr01" state="ok">
<lStk:listStock version="2.0" dateTimeStamp="2014-04-08T14:18:14" dateValidFrom="2014-04-08" state="ok">
<lStk:stock version="2.0">
<stk:code>90000000</stk:code>
<stk:count>975.0</stk:count>
<stk:name>Product name</stk:name>
</lStk:stock>
</lStk:listStock></rsp:responsePackItem></rsp:responsePack>
db.xml
<?xml version="1.0" encoding="utf-8" ?>
<data>
<record>
<product_id><![CDATA[90000000]]></product_id>
<name><![CDATA[productname]]></name>
<Deeplink><![CDATA[product url]]></Deeplink>
<Img_url><![CDATA[product img_url]]></Img_url>
<category_path><![CDATA[product category]]></category_path>
<Price><![CDATA[product price]]></Price>
</record>
</data>
With these XML files I'm getting the following output:
Code 90000000 Category product category
A problem with this implementation is the memory consumption of the $records
array. If the second XML file gets too big you are going to end up with an array of thousands of elements. If this problem arises you could solve it by building an SQLite database file on disk instead of an array, or maybe not storing the full SimpleXMLElement $record
object in the array under each product_id key.
EDIT: Fixed an error in line 23 of the script.
How to parse a large XML file
For parsing large documents like this I suggest using a streaming parser like XMLReader which will allow you to parse XML without loading the entire file into memory at once. By using its expand()
method it's easy to use it in hand with the DOM API.
Tree-based parsers like the DOM are very fast, but take up more memory as the entire document must be loaded up. Streaming parsers like XMLReader keep the memory use down as you're only grabbing a bit of the document at a time, but the trade off is longer processing time.
By using both you can adjust how you use each in tandem in order to get under any hard bounds like memory limits while minimizing processing time.
Example:
$dom = new DOMDocument();
$xpath = new DOMXPath($dom);
$reader = new XMLReader();
$reader->open('file.xml');
while ($reader->read()) {
if ($reader->nodeType == XMLReader::ELEMENT && $reader->name == 'Book') {
$node = $dom->importNode($reader->expand(), true);
$result = $xpath->evaluate(
'string(self::Book[BookCode = "AD0WNR"]/Subject)',
$node
);
if ($result) {
echo $result;
$reader->close();
break;
}
}
}
What this is doing is iterating through the nodes in the XML. Whenever it hits an element <Book>
we:
- Import that into the DOM.
- Evaluate the XPath expression*.
If the XPath expression found what we're looking for:
- Print the result.
- Close the file.
- Break the read loop.
#2 and #3 we do because we're only looking for a single result. If you have more you want to find, remove those and keep on trucking.
(* I've replace the initial double forward slash from the XPath expression with self::
to act on the context node passed as the second parameter to evaluate()
- thanks, @ThW)
Related Topics
How to Get CPU Usage and Ram Usage Without Exec
What Is the Equivalent of JavaScript's Encodeuricomponent in PHP
Relative Path Not Working in Cron PHP Script
Many Hash Iterations: Append Salt Every Time
"Strict Standards: Only Variables Should Be Passed by Reference" Error
Access a File Which Is Located Before/Outside the Server Root Directory
How to Write a Recursive Regex That Matches Nested Parentheses
Array_Key_Exists Is Not Working
Sqlstate[42S22]: Column Not Found: 1054 Unknown Column - Laravel
How to Fix Memory Getting Exhausted with PHPexcel
Does Utc Observe Daylight Saving Time
Replace String in Text File Using PHP
How to Assign PHP Variable Value to JavaScript Variable
Can You Re-Populate File Inputs After Failed Form Submission with PHP or JavaScript