What's the Difference Between PHP's Dom and Simplexml Extensions

What's the difference between PHP's DOM and SimpleXML extensions?

In a nutshell:

SimpleXml

  • is for simple XML and/or simple UseCases
  • limited API to work with nodes (e.g. cannot program to an interface that much)
  • all nodes are of the same kind (element node is the same as attribute node)
  • nodes are magically accessible, e.g. $root->foo->bar['attribute']

DOM

  • is for any XML UseCase you might have
  • is an implementation of the W3C DOM API (found implemented in many languages)
  • differentiates between various Node Types (more control)
  • much more verbose due to explicit API (can code to an interface)
  • can parse broken HTML
  • allows you to use PHP functions in XPath queries

Both of these are based on libxml and can be influenced to some extend by the libxml functions


Personally, I dont like SimpleXml too much. That's because I dont like the implicit access to the nodes, e.g. $foo->bar[1]->baz['attribute']. It ties the actual XML structure to the programming interface. The one-node-type-for-everything is also somewhat unintuitive because the behavior of the SimpleXmlElement magically changes depending on it's contents.

For instance, when you have <foo bar="1"/> the object dump of /foo/@bar will be identical to that of /foo but doing an echo of them will print different results. Moreover, because both of them are SimpleXml elements, you can call the same methods on them, but they will only get applied when the SimpleXmlElement supports it, e.g. trying to do $el->addAttribute('foo', 'bar') on the first SimpleXmlElement will do nothing. Now of course it is correct that you cannot add an attribute to an Attribute Node, but the point is, an attribute node would not expose that method in the first place.

But that's just my 2c. Make up your own mind :)


On a sidenote, there is not two parsers, but a couple more in PHP. SimpleXml and DOM are just the two that parse a document into a tree structure. The others are either pull or event based parsers/readers/writers.

Also see my answer to

  • Best XML Parser for PHP

What's the difference between the different XML parsing libraries in PHP5?

Just to clear up the confusion here. PHP has a number of XML libraries, because php4 didn't have very good options in that direction. From PHP5, you have the choice between SimpleXml, DOM and the sax-based expat parser. The latter also existed in php4. php4 also had a DOM extension, which is not the same as php5's.

DOM and SimpleXml are alternatives to the same problem domain; They læoad the document into memory and let you access it as a tree-structure. DOM is a rather bulky api, but it's also very consistent and it's implemented in many languages, meaning that you can re-use your knowledge across languages (In Javascript for example). SimpleXml may be easier initially.

The SAX parser is a different beast. It treats an xml document as a stream of tags. This is useful if you are dealing with very large documents, since you don't need to hold it all in memory.

For your usage, I would probably use the DOM api.

How to tell apart SimpleXML objects representing element and attribute?

There are no built-in properties in SimpleXMLElement which would allow you to tell these apart.

As others have suggested dom_import_simplexml can be appropriate, however, that function can change nodes on the fly sometimes, for example, if you pass in a list of childnodes or named childnodes, it will take those and turn them into the first element.

If it's an empty list, for example no attributes returned from attributes() or non-existing named childnodes, it will give a warning telling you an invalid nodetype has been given:

Warning: dom_import_simplexml(): Invalid Nodetype to import

So if you need this precise with a snappy boolean true/false, here is how it works with Simplexml:

$isElement   = $element->xpath('.') == array($element);

$isAttribute = $element[0] == $element
and $element->xpath('.') != array($element);

It works similar with attribute lists and element lists, I've just blogged about this in the morning, you need to have some specific knowledge about what to evaluate for what, so I created a cheatsheet for it:

+------------------+---------------------------------------------+
| TYPE | TEST |
+------------------+---------------------------------------------+
| Element | $element->xpath('.') == array($element) |
+------------------+---------------------------------------------+
| Attribute | $element[0] == $element |
| | and $element->xpath('.') != array($element) |
+------------------+---------------------------------------------+
| Attributes | $element->attributes() === NULL |
+------------------+---------------------------------------------+
| Elements | $element[0] != $element |
| | and $element->attributes() !== NULL |
+------------------+---------------------------------------------+
| Single | $element[0] == $element |
+------------------+---------------------------------------------+
| Empty List | $element[0] == NULL |
+------------------+---------------------------------------------+
| Document Element | $element->xpath('/*') == array($element) |
+------------------+---------------------------------------------+
  • SimpleXML Type Cheatsheet (12 Feb 2013; by hakre)

PHP's SimpleXML doesn't keep order between different element types

"If I ... used either SimpleXMLIterator or SimpleXMLElement to parse it, I would end up with an array" - no you wouldn't, you would end up with an object, which happens to behave like an array in certain ways.

The output of a recursive dump of that object is not the same as the result of iterating over it.

In particular, running foreach( $some_node->children() as $child_node ) will give you all the children of a node in the order they appear in the document, regardless of name, as shown in this live code demo.

Code:

$xml = <<<EOF
<catalog>
<book>
<title>Harry Potter and the Chamber of Secrets</title>
<author>J.K. Rowling</author>
</book>
<movie>
<title>The Dark Knight</title>
<director>Christopher Nolan</director>
</movie>
<book>
<title>Great Expectations</title>
<author>Charles Dickens</author>
</book>
<movie>
<title>Avatar</title>
<director>Christopher Nolan</director>
</movie>
</catalog>
EOF;

$sx = simplexml_load_string($xml);
foreach ( $sx->children() as $node )
{
echo $node->getName(), '<br />';
}

Output:

book
movie
book
movie

Forcing UTF8 Format with PHP's XMLReader, DOM and SimpleXML

Use HTML Tidy library first to clean your string.

Also I'd better use DOMDocument instead of XMLReader.

Something like that:

        $tidy = new Tidy;

$config = array(
'drop-font-tags' => true,
'drop-proprietary-attributes' => true,
'hide-comments' => true,
'indent' => true,
'logical-emphasis' => true,
'numeric-entities' => true,
'output-xhtml' => true,
'wrap' => 0
);

$tidy->parseString($html, $config, 'utf8');

$tidy->cleanRepair();

$xml = $tidy->value; // Get clear string

$dom = new DOMDocument;

$dom->loadXML($xml);

...

Output BR tag, using simpleXML

Firstly, PHP's SimpleXML extension works only with XML, not HTML. You're rightly mentioning XHTML in your setup code, but that means you need to use XML self-closing elements like <br /> not HTML unclosed tags like <br>.

Secondly, the addChild method takes text content as its second parameter, not raw document content; so as you've seen, it will automatically escape < and > for you.

SimpleXML is really designed around the kind of XML that's a strict tree of elements, rather than a markup language with elements interleaved with text content like XHTML, so this is probably a case where you're better off sticking to the DOM.

Even then, there's no equivalent of the JS "innerhtml" property, I'm afraid, so I believe you'll have to add the text and br element as separate nodes, e.g.

$body = $html->appendChild( $dom->createElement('head') );

$body->appendChild( $dom->createTextNode('hello') );
$body->appendChild( $dom->createElement('br') );
$body->appendChild( $dom->createTextNode('world') );

What is the DOM Core Level / Version Supported by PHP DOM?

PHP DOM Extension has the Document Object Model (Core) Level 1 feature. You can test for features that are implemented with a helper method and then testing for features and versions, here a summary for four features:

  • One Core versions found: '1.0'.
  • Four XML versions found: '2.0'; '1.0'; ''; NULL.
  • Zero HTML versions found.
  • Zero XHTML versions found.
  • Zero XPath versions found.

This result combine with the specs is puzzeling if not esoteric. The Core Feature in Level 1.0 requires to return TRUE as well for a non-specified version (here: for '' and NULL), but as the results show, it does not. So even DOM Core Level 1 is announced as feature, it's also broken.

Also the XML Feature can not be level 2.0 if the Core feature of level 2.0 is not supported - and this is the case here, Core Level 2.0 is not a supported feature.

Features in DOM (source):

Features in DOM

Exemplary Output of my example script:

Core Feature is in PHP DOMDocument implementation:

1.) Core '3.0': FALSE
2.) Core '2.0': FALSE
3.) Core '1.0': TRUE
4.) Core '' : FALSE
5.) Core NULL : FALSE

One Core versions found: '1.0'.

XML Feature is in PHP DOMDocument implementation:

1.) XML '3.0': FALSE
2.) XML '2.0': TRUE
3.) XML '1.0': TRUE
4.) XML '' : TRUE
5.) XML NULL : TRUE

Four XML versions found: '2.0'; '1.0'; ''; NULL.

HTML Feature is in PHP DOMDocument implementation:

1.) HTML '3.0': FALSE
2.) HTML '2.0': FALSE
3.) HTML '1.0': FALSE
4.) HTML '' : FALSE
5.) HTML NULL : FALSE

Zero HTML versions found.

XHTML Feature is in PHP DOMDocument implementation:

1.) XHTML '3.0': FALSE
2.) XHTML '2.0': FALSE
3.) XHTML '1.0': FALSE
4.) XHTML '' : FALSE
5.) XHTML NULL : FALSE

Zero XHTML versions found.

XPath Feature is in PHP DOMDocument implementation:

1.) XPath '3.0': FALSE
2.) XPath '2.0': FALSE
3.) XPath '1.0': FALSE
4.) XPath '' : FALSE
5.) XPath NULL : FALSE

Zero XPath versions found.

Example script:

<?php
/**
* What is the DOM Core Version is Supported by PHP DOM?
* @link http://stackoverflow.com/a/17340953/367456
*/

$dom = new DOMDocument();
$dom->loadXML('<root/>');

$versionsArray = ['3.0', '2.0', '1.0', '', NULL];
$features = [
# Document Object Model (DOM) <http://www.w3.org/DOM/DOMTR>
'Core' => $versionsArray,

# Document Object Model (DOM) <http://www.w3.org/DOM/DOMTR>
'XML' => $versionsArray,

# Document Object Model (DOM) Level 2 HTML Specification <http://www.w3.org/TR/DOM-Level-2-HTML/>
'HTML' => $versionsArray,
'XHTML' => $versionsArray,

# Document Object Model XPath <http://www.w3.org/TR/DOM-Level-3-XPath/xpath.html>
"XPath" => $versionsArray,
];

const DISPLAY_TITLE = 1;
const DISPLAY_DETAILS = 2;
const DISPLAY_SUMMARY = 4;
const DISPLAY_ALL = 7;

dom_list_features($dom, $features);

function dom_list_features(DOMDocument $dom, array $features, $display = DISPLAY_ALL) {

foreach ($features as $feature => $versions) {
dom_list_feature($dom, $feature, $versions, $display);
}
}

function dom_list_feature(DOMDocument $dom, $feature, array $versions, $display) {

if ($display & DISPLAY_TITLE) {
echo "$feature Feature is in PHP DOMDocument implementation:\n\n";
}

$found = [];

foreach ($versions as $i => $version) {
$result = $dom->implementation->hasFeature($feature, $version);
if ($result) {
$found[] = $version;
}

if ($display & DISPLAY_DETAILS) {
printf(" %d.) $feature %' -5s: %s\n", $i + 1, var_export($version, true), $result ? 'TRUE' : 'FALSE');
}
}

if ($display & DISPLAY_DETAILS) {
echo "\n";
}

$formatter = new NumberFormatter('en_UK', NumberFormatter::SPELLOUT);
$count = ucfirst($formatter->format(count($found)));
$found = array_map(function ($v) {
return var_export($v, TRUE);
}, $found);

if ($display & DISPLAY_SUMMARY) {
printf("%s %s versions found%s.\n\n", $count, $feature, $found ? ': ' . implode('; ', $found) : '');
}
}

Accessing processing instructions with PHP's SimpleXML

The problem is that < ? php ? > is considered a tag... so it gets parsed into a single big tag element. You'd need to do:

$xml = file_get_contents('myxmlfile.xml');
$xml = str_replace('<?php', '<![CDATA[ <?php', $xml);
$xml = str_replace('?>', '?> ]]>', $xml);
$xml = simplexml_load_string($xml, "SimpleXMLElement", LIBXML_NOCDATA);

I'm not entirely sure this would work, but i think it will. Test it out...

DOM extension on PHP 7.2 Amazon Linux 2

After 2 days of nit-picking it turned out restarting php-fpm resolves the issue, though I was restarting httpd all along.

sudo systemctl restart php-fpm


Related Topics



Leave a reply



Submit