Parsing Huge XML Files in PHP
There are only two php APIs that are really suited for processing large files. The first is the old expat api, and the second is the newer XMLreader functions. These apis read continuous streams rather than loading the entire tree into memory (which is what simplexml and DOM does).
For an example, you might want to look at this partial parser of the DMOZ-catalog:
<?php
class SimpleDMOZParser
{
protected $_stack = array();
protected $_file = "";
protected $_parser = null;
protected $_currentId = "";
protected $_current = "";
public function __construct($file)
{
$this->_file = $file;
$this->_parser = xml_parser_create("UTF-8");
xml_set_object($this->_parser, $this);
xml_set_element_handler($this->_parser, "startTag", "endTag");
}
public function startTag($parser, $name, $attribs)
{
array_push($this->_stack, $this->_current);
if ($name == "TOPIC" && count($attribs)) {
$this->_currentId = $attribs["R:ID"];
}
if ($name == "LINK" && strpos($this->_currentId, "Top/Home/Consumer_Information/Electronics/") === 0) {
echo $attribs["R:RESOURCE"] . "\n";
}
$this->_current = $name;
}
public function endTag($parser, $name)
{
$this->_current = array_pop($this->_stack);
}
public function parse()
{
$fh = fopen($this->_file, "r");
if (!$fh) {
die("Epic fail!\n");
}
while (!feof($fh)) {
$data = fread($fh, 4096);
xml_parse($this->_parser, $data, feof($fh));
}
}
}
$parser = new SimpleDMOZParser("content.rdf.u8");
$parser->parse();
Parsing extremely large XML files in php
In PHP, you can read in extreme large XML files with the XMLReader
Docs:
$reader = new XMLReader();
$reader->open($xmlfile);
Extreme large XML files should be stored in a compressed format on disk. At least this makes sense as XML files have a high compression ratio. For example gzipped like large.xml.gz
.
PHP supports that quite well with XMLReader
via the compression wrappersDocs:
$xmlfile = 'compress.zlib://path/to/large.xml.gz';
$reader = new XMLReader();
$reader->open($xmlfile);
The XMLReader
allows you to operate on the current element "only". That means it's forward-only. If you need to keep parser state, you need to build it your own.
I often find it helpful to wrap the basic movements into a set of iterators that know how to operate on XMLReader
like iterating through elements or child-elements only. You find this outlined in Parse XML with PHP and XMLReader.
See as well:
- PHP open gzipped XML
Parse Large XML File in PHP Efficiently to Generate SQL
Alright, I have a working example for you with much improvement in execution speed, memory usage, and database load:
<?php
define('INSERT_BATCH_SIZE', 500);
define('DRUG_XML_FILE', 'drugbank.xml');
$servername = "localhost"; // Example : localhost
$username = "root";
$password = "pass";
$dbname = "dbname";
function parseXml($mysql)
{
$drugs = array();
$xmlReader = new XMLReader();
$xmlReader->open(DRUG_XML_FILE);
// Move our pointer to the first <drug /> element.
while ($xmlReader->read() && $xmlReader->name !== 'drug') ;
$drugCount = 0;
$totalDrugs = 0;
// Iterate over the outer <drug /> elements.
while ($xmlReader->name == 'drug')
{
// Convert the node into a SimpleXMLElement for ease of use.
$item = new SimpleXMLElement($xmlReader->readOuterXML());
$name = $item->name;
$description = $item->description;
$casNumber = $item->{'cas-number'};
$created = $item['created'];
$updated = $item['updated'];
$type = $item['type'];
$drugs[] = "('$name', '$description','$casNumber','$created','$updated','$type')";
$drugCount++;
$totalDrugs++;
// Once we've reached the desired batch size, insert the batch and reset the counter.
if ($drugCount >= INSERT_BATCH_SIZE)
{
batchInsertDrugs($mysql, $drugs);
$drugCount = 0;
}
// Go to next <drug />.
$xmlReader->next('drug');
}
$xmlReader->close();
// Insert the leftovers from the last batch.
batchInsertDrugs($mysql, $drugs);
echo "Inserted $totalDrugs total drugs.";
}
function batchInsertDrugs($mysql, &$drugs)
{
// Generate a batched INSERT statement.
$statement = "INSERT INTO `drug` (name, description, cas_number, created, updated, type) VALUES";
$statement = $statement . ' ' . implode(",\n", $drugs);
echo $statement, "\n";
// Run the batch INSERT.
if ($mysql->query($statement))
{
echo "Inserted " . count($drugs) . " drugs.";
}
else
{
echo "INSERT Error: " . $statement . "<br>" . $mysql->error. "<br>" ;
}
// Clear the buffer.
$drugs = array();
}
// Create MySQL connection.
$mysql = new mysqli($servername, $username, $password, $dbname);
if ($mysql->connect_error)
{
die("Connection failed: " . $mysql->connect_error);
}
parseXml($mysql);
I tested this example using the same dataset.
Using SimpleXML in the way that you are leads to parsing the entire document in memory, which is slow and memory-intensive. This approach uses XMLReader, which is a fast pull-parser. You can probably make this faster still using the PHP SAX XML Parser, but it's a bit more complex of a pattern, and the above example will be noticeably better than what you started with.
The other significant change in my example is that we're using MySQL Batched Inserts, so we only actually hit the database every 500
(configurable) items we process. You can tweak this number for better performance. After a certain point, the query will become too large for MySQL to process, but you may be able to do a lot more than 500
at one time.
If you'd like me to explain any part of this further, or if you have any problems with it, just let me know in the comments! :)
How to parse a large XML file
For parsing large documents like this I suggest using a streaming parser like XMLReader which will allow you to parse XML without loading the entire file into memory at once. By using its expand()
method it's easy to use it in hand with the DOM API.
Tree-based parsers like the DOM are very fast, but take up more memory as the entire document must be loaded up. Streaming parsers like XMLReader keep the memory use down as you're only grabbing a bit of the document at a time, but the trade off is longer processing time.
By using both you can adjust how you use each in tandem in order to get under any hard bounds like memory limits while minimizing processing time.
Example:
$dom = new DOMDocument();
$xpath = new DOMXPath($dom);
$reader = new XMLReader();
$reader->open('file.xml');
while ($reader->read()) {
if ($reader->nodeType == XMLReader::ELEMENT && $reader->name == 'Book') {
$node = $dom->importNode($reader->expand(), true);
$result = $xpath->evaluate(
'string(self::Book[BookCode = "AD0WNR"]/Subject)',
$node
);
if ($result) {
echo $result;
$reader->close();
break;
}
}
}
What this is doing is iterating through the nodes in the XML. Whenever it hits an element <Book>
we:
- Import that into the DOM.
- Evaluate the XPath expression*.
If the XPath expression found what we're looking for:
- Print the result.
- Close the file.
- Break the read loop.
#2 and #3 we do because we're only looking for a single result. If you have more you want to find, remove those and keep on trucking.
(* I've replace the initial double forward slash from the XPath expression with self::
to act on the context node passed as the second parameter to evaluate()
- thanks, @ThW)
Best way to process large XML in PHP
For a large file, you'll want to use a SAX parser rather than a DOM parser.
With a DOM parser it will read in the whole file and load it into an object tree in memory. With a SAX parser, it will read the file sequentially and call your user-defined callback functions to handle the data (start tags, end tags, CDATA, etc.)
With a SAX parser you'll need to maintain state yourself (e.g. what tag you are currently in) which makes it a bit more complicated, but for a large file it will be much more efficient memory wise.
How to edit large XML files in PHP based on a record in the XML Node
Goal
Desired result: I want to create a new XML file with only the records where the child "ShowOnWebsite" is true.
Given
test.xml
<Items>
<Item>
<Barcode>...</Barcode>
<BrandCode>...</BrandCode>
<Title>...</Title>
<Content>...</Content>
<ShowOnWebsite>false</ShowOnWebsite>
</Item>
<Item>
<Barcode>...</Barcode>
<BrandCode>...</BrandCode>
<Title>...</Title>
<Content>...</Content>
<ShowOnWebsite>true</ShowOnWebsite>
</Item>
<Item>
<Barcode>...</Barcode>
<BrandCode>...</BrandCode>
<Title>...</Title>
<Content>...</Content>
<ShowOnWebsite>false</ShowOnWebsite>
</Item>
</Items>
Code
This is the implementation I wrote. The getItems
yields the childs without loading the xml at once into the memory.
function getItems($fileName) {
if ($file = fopen($fileName, "r")) {
$buffer = "";
$active = false;
while(!feof($file)) {
$line = fgets($file);
$line = trim(str_replace(["\r", "\n"], "", $line));
if($line == "<Item>") {
$buffer .= $line;
$active = true;
} elseif($line == "</Item>") {
$buffer .= $line;
$active = false;
yield new SimpleXMLElement($buffer);
$buffer = "";
} elseif($active == true) {
$buffer .= $line;
}
}
fclose($file);
}
}
$output = new SimpleXMLElement('<?xml version="1.0" encoding="utf-8"?><Items></Items>');
foreach(getItems("test.xml") as $element)
{
if($element->ShowOnWebsite == "true") {
$item = $output->addChild('Item');
$item->addChild('Barcode', (string) $element->Barcode);
$item->addChild('BrandCode', (string) $element->BrandCode);
$item->addChild('Title', (string) $element->Title);
$item->addChild('Content', (string) $element->Content);
$item->addChild('ShowOnWebsite', $element->ShowOnWebsite);
}
}
$fileName = __DIR__ . "/test_" . rand(100, 999999) . ".xml";
$output->asXML($fileName);
Output
<?xml version="1.0" encoding="utf-8"?>
<Items><Item><Barcode>...</Barcode><BrandCode>...</BrandCode><Title>...</Title><Content>...</Content><ShowOnWebsite>true</ShowOnWebsite></Item></Items>
Parsing Huge XML Files in PHP
There are only two php APIs that are really suited for processing large files. The first is the old expat api, and the second is the newer XMLreader functions. These apis read continuous streams rather than loading the entire tree into memory (which is what simplexml and DOM does).
For an example, you might want to look at this partial parser of the DMOZ-catalog:
<?php
class SimpleDMOZParser
{
protected $_stack = array();
protected $_file = "";
protected $_parser = null;
protected $_currentId = "";
protected $_current = "";
public function __construct($file)
{
$this->_file = $file;
$this->_parser = xml_parser_create("UTF-8");
xml_set_object($this->_parser, $this);
xml_set_element_handler($this->_parser, "startTag", "endTag");
}
public function startTag($parser, $name, $attribs)
{
array_push($this->_stack, $this->_current);
if ($name == "TOPIC" && count($attribs)) {
$this->_currentId = $attribs["R:ID"];
}
if ($name == "LINK" && strpos($this->_currentId, "Top/Home/Consumer_Information/Electronics/") === 0) {
echo $attribs["R:RESOURCE"] . "\n";
}
$this->_current = $name;
}
public function endTag($parser, $name)
{
$this->_current = array_pop($this->_stack);
}
public function parse()
{
$fh = fopen($this->_file, "r");
if (!$fh) {
die("Epic fail!\n");
}
while (!feof($fh)) {
$data = fread($fh, 4096);
xml_parse($this->_parser, $data, feof($fh));
}
}
}
$parser = new SimpleDMOZParser("content.rdf.u8");
$parser->parse();
How can I parse large XML files in PHP?
Switch from using SimpleXML to XMLReader when working with large XML files. This is a Pull parser that will not load the entire file into memory to process it.
Related Topics
Converting a Carbon Date to MySQL Timestamp
Elegant Way to Search an PHP Array Using a User-Defined Function
Php: Re Order Associative Array
PHP JSON_Decode Fails Without Quotes on Key
Does PHP Feature Short Hand Syntax for Objects
Wrapping a Div Around Every Third Item in a Foreach Loop PHP
How to Get a List of MySQL Databases in PHP Using Pdo
How to Destroy Session with Browser Closing in Codeigniter
Ftp Upload File to Distant Server with Curl and PHP Uploads a Blank File
What Does "->" Mean/Refer to in PHP
Understanding PHP "Out of Memory" Error
How to Store a Class Instance in a $_Session Space
Mongodb Aggregate Query Using PHP Driver
No Mail Received in Inbox with Xampp 1.8.0, Mercurymail and Mail()