Simplify PHP Dom Xml Parsing - How

Simplify PHP DOM XML parsing - how?

Solving Problem 1:

The W3C defines: the meaning of the attribute xml:id as an ID attribute in XML documents and defines processing of this attribute to identify IDs in the absence of validation, without fetching external resources, and without relying on an internal subset.

In other words, when you use

$element->setAttribute('xml:id', 'test');

you do not need to call setIdAttribute, nor specify a DTD or Schema. DOM will recognize the xml:id attribute when used with getElementById without you having to validate the document or anything. This is the least effort approach. Note though, that depending on your OS and version of libxml, you wont get getElementById to work at all.

Solving Problem2:

Even with IDs not being fetchable with getElementById, you can still very much fetch them with XPath:

$xpath->query('/pages/page[@id=1]');

would definitely work. And you can also fetch the product children for a specific page directly:

$xpath->query('//pages/page[@id=1]/products');

Apart from this, there is very little you can do to make DOM code look less verbose, because it really is a verbose interface. It has to be, because DOM is a language agnostic interface, again defined by the W3C.


EDIT after comment below

It is working like I explained above. Here is a full test case for you. The first part is for writing new XML files with DOM. That is where you need to set the xml:id attribute. You use this instead of the regular, non-namespaced, id attribute.

// Setup
$dom = new DOMDocument;
$dom->formatOutput = TRUE;
$dom->preserveWhiteSpace = FALSE;
$dom->loadXML('<pages/>');

// How to set a valid id attribute when not using a DTD or Schema
$page1 = $dom->createElement('page');
$page1->setAttribute('xml:id', 'p1');
$page1->appendChild($dom->createElement('product', 'foo1'));
$page1->appendChild($dom->createElement('product', 'foo2'));

// How to set an ID attribute that requires a DTD or Schema when reloaded
$page2 = $dom->createElement('page');
$page2->setAttribute('id', 'p2');
$page2->setIdAttribute('id', TRUE);
$page2->appendChild($dom->createElement('product', 'bar1'));
$page2->appendChild($dom->createElement('product', 'bar2'));

// Appending pages and saving XML
$dom->documentElement->appendChild($page1);
$dom->documentElement->appendChild($page2);
$xml = $dom->saveXML();
unset($dom, $page1, $page2);
echo $xml;

This will create an XML file like this:

<?xml version="1.0"?>
<pages>
<page xml:id="p1">
<product>foo1</product>
<product>foo2</product>
</page>
<page id="p2">
<product>bar1</product>
<product>bar2</product>
</page>
</pages>

When you read in the XML again, the new DOM instance no longer knows you have declared the non-namespaced id attribute as ID attribute with setIdAttribute. It will still be in the XML, but id attribute will just be a regular attribute. You have to be aware that ID attributes are special in XML.

// Load the XML we created above
$dom = new DOMDocument;
$dom->loadXML($xml);

Now for some tests:

echo "\n\n GETELEMENTBYID RETURNS ELEMENT WITH XML:ID \n\n";
foreach( $dom->getElementById('p1')->childNodes as $product) {
echo $product->nodeValue; // Will output foo1 and foo2 with whitespace
}

The above works, because a DOM compliant parser has to recognize xml:id is an ID attribute, regardless of any DTD or Schema. This is explained in the specs linked above.
The reason it outputs whitespace is because due to the formatted output there is DOMText nodes between the opening tag, the two product tags and the closing tags, so we are iterating over five nodes. The node concept is crucial to understand when working with XML.

echo "\n\n GETELEMENTBYID CANNOT FETCH NORMAL ID \n\n";
foreach( $dom->getElementById('p2')->childNodes as $product) {
echo $product->nodeValue; // Will output a NOTICE and a WARNING
}

The above will not work, because id is not an ID attribute. For the DOM parser to recognize it as such, you need a DTD or Schema and the XML must be validated against it.

echo "\n\n XPATH CAN FETCH NORMAL ID \n\n";
$xPath = new DOMXPath($dom);
$page2 = $xPath->query('/pages/page[@id="p2"]')->item(0);
foreach( $page2->childNodes as $product) {
echo $product->nodeValue; // Will output bar1 and bar2
}

XPath on the other hand is literal about the attributes, which means you can query the DOM for the page element with attribute id if getElementById is not available. Note that to query the page with ID p1, you'd have to include the namespace, e.g. @xml:id="p1".

echo "\n\n XPATH CAN FETCH PRODUCTS FOR PAGE WITH ID \n\n";
$xPath = new DOMXPath($dom);
foreach( $xPath->query('/pages/page[@id="p2"]/product') as $product ) {
echo $product->nodeValue; // Will output bar1 and bar2 w\out whitespace
}

And like said, you can also use XPath to query anything else in the document. This one will not output whitespace, because it will only return the product elements below the page with id p2.

You can also traverse the entire DOM from a node. It's a tree structure. Since DOMNode is the most important class in DOM, you want to familiarize yourself with it.

echo "\n\n TRAVERSING UP AND DOWN \n\n";
$product = $dom->getElementsByTagName('product')->item(2);
echo $product->tagName; // 'product'
echo $dom->saveXML($product); // '<product>bar1</product>'

// Going from bar1 to foo1
$product = $product->parentNode // Page Node
->parentNode // Pages Node
->childNodes->item(1) // Page p1
->childNodes->item(1); // 1st Product

echo $product->nodeValue; // 'foo1'

// from foo1 to foo2 it is two(!) nodes because the XML is formatted
echo $product->nextSibling->nodeName; // '#text' with whitespace and linebreak
echo $product->nextSibling->nextSibling->nodeName; // 'product'
echo $product->nextSibling->nextSibling->nodeValue; // 'foo2'

On a sidenote, yes, I do have a typo in the original code above. It's product not products. But I find it hardly justified to claim the code does not work when all you have to change is an s. That just feels too much like wanting to be spoonfed.

How do you parse and process HTML/XML in PHP?


Native XML Extensions

I prefer using one of the native XML extensions since they come bundled with PHP, are usually faster than all the 3rd party libs and give me all the control I need over the markup.

DOM

The DOM extension allows you to operate on XML documents through the DOM API with PHP 5. It is an implementation of the W3C's Document Object Model Core Level 3, a platform- and language-neutral interface that allows programs and scripts to dynamically access and update the content, structure and style of documents.

DOM is capable of parsing and modifying real world (broken) HTML and it can do XPath queries. It is based on libxml.

It takes some time to get productive with DOM, but that time is well worth it IMO. Since DOM is a language-agnostic interface, you'll find implementations in many languages, so if you need to change your programming language, chances are you will already know how to use that language's DOM API then.

How to use the DOM extension has been covered extensively on StackOverflow, so if you choose to use it, you can be sure most of the issues you run into can be solved by searching/browsing Stack Overflow.

A basic usage example and a general conceptual overview are available in other answers.

XMLReader

The XMLReader extension is an XML pull parser. The reader acts as a cursor going forward on the document stream and stopping at each node on the way.

XMLReader, like DOM, is based on libxml. I am not aware of how to trigger the HTML Parser Module, so chances are using XMLReader for parsing broken HTML might be less robust than using DOM where you can explicitly tell it to use libxml's HTML Parser Module.

A basic usage example is available in another answer.

XML Parser

This extension lets you create XML parsers and then define handlers for different XML events. Each XML parser also has a few parameters you can adjust.

The XML Parser library is also based on libxml, and implements a SAX style XML push parser. It may be a better choice for memory management than DOM or SimpleXML, but will be more difficult to work with than the pull parser implemented by XMLReader.

SimpleXml

The SimpleXML extension provides a very simple and easily usable toolset to convert XML to an object that can be processed with normal property selectors and array iterators.

SimpleXML is an option when you know the HTML is valid XHTML. If you need to parse broken HTML, don't even consider SimpleXml because it will choke.

A basic usage example is available, and there are lots of additional examples in the PHP Manual.



3rd Party Libraries (libxml based)

If you prefer to use a 3rd-party lib, I'd suggest using a lib that actually uses DOM/libxml underneath instead of string parsing.

FluentDom

FluentDOM provides a jQuery-like fluent XML interface for the DOMDocument in PHP. Selectors are written in XPath or CSS (using a CSS to XPath converter). Current versions extend the DOM implementing standard interfaces and add features from the DOM Living Standard. FluentDOM can load formats like JSON, CSV, JsonML, RabbitFish and others. Can be installed via Composer.

HtmlPageDom

Wa72\HtmlPageDom is a PHP library for easy manipulation of HTML
documents using DOM. It requires DomCrawler from Symfony2
components for traversing
the DOM tree and extends it by adding methods for manipulating the
DOM tree of HTML documents.

phpQuery

phpQuery is a server-side, chainable, CSS3 selector driven Document Object Model (DOM) API based on jQuery JavaScript Library.
The library is written in PHP5 and provides additional Command Line Interface (CLI).

This is described as "abandonware and buggy: use at your own risk" but does appear to be minimally maintained.

laminas-dom

The Laminas\Dom component (formerly Zend_DOM) provides tools for working with DOM documents and structures. Currently, we offer Laminas\Dom\Query, which provides a unified interface for querying DOM documents utilizing both XPath and CSS selectors.

This package is considered feature-complete, and is now in security-only maintenance mode.

fDOMDocument

fDOMDocument extends the standard DOM to use exceptions at all occasions of errors instead of PHP warnings or notices. They also add various custom methods and shortcuts for convenience and to simplify the usage of DOM.

sabre/xml

sabre/xml is a library that wraps and extends the XMLReader and XMLWriter classes to create a simple "xml to object/array" mapping system and design pattern. Writing and reading XML is single-pass and can therefore be fast and require low memory on large xml files.

FluidXML

FluidXML is a PHP library for manipulating XML with a concise and fluent API.
It leverages XPath and the fluent programming pattern to be fun and effective.



3rd-Party (not libxml-based)

The benefit of building upon DOM/libxml is that you get good performance out of the box because you are based on a native extension. However, not all 3rd-party libs go down this route. Some of them listed below

PHP Simple HTML DOM Parser

  • An HTML DOM parser written in PHP5+ lets you manipulate HTML in a very easy way!
  • Require PHP 5+.
  • Supports invalid HTML.
  • Find tags on an HTML page with selectors just like jQuery.
  • Extract contents from HTML in a single line.

I generally do not recommend this parser. The codebase is horrible and the parser itself is rather slow and memory hungry. Not all jQuery Selectors (such as child selectors) are possible. Any of the libxml based libraries should outperform this easily.

PHP Html Parser

PHPHtmlParser is a simple, flexible, html parser which allows you to select tags using any css selector, like jQuery. The goal is to assiste in the development of tools which require a quick, easy way to scrape html, whether it's valid or not! This project was original supported by sunra/php-simple-html-dom-parser but the support seems to have stopped so this project is my adaptation of his previous work.

Again, I would not recommend this parser. It is rather slow with high CPU usage. There is also no function to clear memory of created DOM objects. These problems scale particularly with nested loops. The documentation itself is inaccurate and misspelled, with no responses to fixes since 14 Apr 16.



HTML 5

You can use the above for parsing HTML5, but there can be quirks due to the markup HTML5 allows. So for HTML5 you may want to consider using a dedicated parser. Note that these are written in PHP, so suffer from slower performance and increased memory usage compared to a compiled extension in a lower-level language.

HTML5DomDocument

HTML5DOMDocument extends the native DOMDocument library. It fixes some bugs and adds some new functionality.

  • Preserves html entities (DOMDocument does not)
  • Preserves void tags (DOMDocument does not)
  • Allows inserting HTML code that moves the correct parts to their proper places (head elements are inserted in the head, body elements in the body)
  • Allows querying the DOM with CSS selectors (currently available: *, tagname, tagname#id, #id, tagname.classname, .classname, tagname.classname.classname2, .classname.classname2, tagname[attribute-selector], [attribute-selector], div, p, div p, div > p, div + p, and p ~ ul.)
  • Adds support for element->classList.
  • Adds support for element->innerHTML.
  • Adds support for element->outerHTML.

HTML5

HTML5 is a standards-compliant HTML5 parser and writer written entirely in PHP. It is stable and used in many production websites, and has well over five million downloads.

HTML5 provides the following features.

  • An HTML5 serializer
  • Support for PHP namespaces
  • Composer support
  • Event-based (SAX-like) parser
  • A DOM tree builder
  • Interoperability with QueryPath
  • Runs on PHP 5.3.0 or newer


Regular Expressions

Last and least recommended, you can extract data from HTML with regular expressions. In general using Regular Expressions on HTML is discouraged.

Most of the snippets you will find on the web to match markup are brittle. In most cases they are only working for a very particular piece of HTML. Tiny markup changes, like adding whitespace somewhere, or adding, or changing attributes in a tag, can make the RegEx fails when it's not properly written. You should know what you are doing before using RegEx on HTML.

HTML parsers already know the syntactical rules of HTML. Regular expressions have to be taught for each new RegEx you write. RegEx are fine in some cases, but it really depends on your use-case.

You can write more reliable parsers, but writing a complete and reliable custom parser with regular expressions is a waste of time when the aforementioned libraries already exist and do a much better job on this.

Also see Parsing Html The Cthulhu Way



Books

If you want to spend some money, have a look at

  • PHP Architect's Guide to Webscraping with PHP

I am not affiliated with PHP Architect or the authors.

How can I parse XML while keeping the original order of the word in php?

With DOMDocument, you should be able to easily get the value you want. Check out this example:

$xmlString = '<root>
<span class="full_collocation">
the<strong class="tilde">Bank</strong> for International Settlements
</span>
<span class="full_collocation">
[<span class="or"><acronym title="or">or</acronym></span> BIZ]
</span>
</root>';

$dom = new DOMDocument();
$dom->loadXML($xmlString);
foreach($dom->documentElement->childNodes as $childNode) {
echo trim($childNode->textContent); // prints "theBank for International Settlements" and "[or BIZ]"
}

How to parse PCDATA and child element separately with PHP DOM?

DOMNode->nodeValue (which in PHP's DOMElement is the same as DOMNode->textContent) will contain the complete text content from itself and all its descending nodes. Or, to put it a little more simple: it contains the complete content of the node, but with all tags removed.

What you probably want to try is the something like the following (untested):

if($level1->tagName == "p") {
echo "<p>";
// loop through all childNodes, not just noteref elements
foreach($level1->childNodes as $childNode) {
// you could also use if() statements here, of course
switch($childNode->nodeType) {
// if it's just text
case XML_TEXT_NODE:
echo $childNode->nodeValue;
break;
// if it's an element
case XML_ELEMENT_NODE:
echo "<span><b>".$childNode->nodeValue."</b></span>";
break;
}
}
echo "</p><br>";
}

Be aware though that this is still rather flimsy. For instance: if any other elements, besides <noteref> elements, show up in the <p> elements, they will also be wrapped in <span><b> elements.

Hopefully I've at least given you a clue as to why your result <p> elements showed the contents of the child elements as well.


As a side note: if what you want to achieve is transform the contents of an XML document into HTML or perhaps some other XML structure, it might pay off to look into XSLT. Be aware though that the learning curve could be steep.

PHP XML DOM getElementById

A common workaround is to use XPath to get the element.

$item = $xpath->query('//item[@id="Flow_0"]')->item(0);


Related Topics



Leave a reply



Submit