Why Doesn't Var_Dump Work with Domdocument Objects, While Print($Dom->Savehtml()) Does

Why doesn't var_dump work with DOMDocument objects, while print($dom-saveHTML()) does?

Update: As of PHP 5.4.1 you can finally var_dump DOM objects. See https://gist.github.com/2499678


It's a bug:

  • https://bugs.php.net/bug.php?id=48527

Can't print_r domDocument

When you create a DOMDocument instance, you have a PHP object. The DOM classes do not implement a helpful __toString functionality.

To get HTML from a DOMDocument instance, you'll need to use saveHTML:

print_r($dom->saveHTML());

NB that your question suggests you are actually looking at a collection of elements (a DOMNodeList) rather than an actual DOMDocument instance. Depending on your code, you'll need to extract the code for these individually:

foreach ($elements as $el) {
print_r($dom->saveHTML($el)); // use saveXML if you are using a version before 5.3.6
}

Why does PHP treats this DOM object like an array?

childNodes is a property of DOMNodeList type. The reason why var_dump doesn't show anything about it is simply because var_dump shows only those class properties that have been declared by their developers by calling such C-functions as

ZEND_API int zend_declare_property(...)
ZEND_API int zend_declare_property_null(...)
ZEND_API int zend_declare_property_bool(...)
ZEND_API int zend_declare_property_long(...)
ZEND_API int zend_declare_property_double(...)
ZEND_API int zend_declare_property_string(...)
ZEND_API int zend_declare_property_stringl(...)

Source: answer by akond: Why doesn't var_dump work with DomDocument objects, while print($dom->saveHTML()) does?

That is, developers of DOM extension chose not to expose the structure of DOMNodeList class.

The reason why you can iterate through DOMNodeList is because it implements Traversable interface which signals that the class can be iterated through by using foreach.

Problem with iterating throught DOM using DOMDocument

I think the main problem is that when you use getElementsByTagName(), this returns a list of nodes (actually a DOMNodeList). So when you want to access (for example) the first item for that tag, you will need to reference the first item in an array.

If you extended your initial code to get the nested tag elements, you could end up with the following code, which always uses [0] on the result of getElementsByTagName() to pick out the first item.

$title = $doc->getElementById('info')->childNodes->item(1)->nodeValue;
$volume_list = $doc->getElementById('list')->getElementsByTagName('dl');

$a = $volume_list[0]->getElementsByTagName('dd')[0]->getElementsByTagName('a');

echo $a[0]->getAttribute('href');

DOMDocument-saveHTML isn't working

My_ friend this is not how it works.

You should have your edited HTML in the result of saveHTML() so:

$editedHtml = $dom->saveHTML()
var_dump($editedHtml);

Now you should see your changed HTML.

Explanation is that $page is completely different object that has nothing to do with $dom object.

Cheers!

Why am I not getting back any images here?

It appears $html's contents stop at the tag for this page. Any idea why?

Yes, you must provide this page with a valid user agent.

$url = 'http://www.w3schools.com/js/js_loop_for.asp';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0");
curl_exec($ch);

outputs everything to the ending </html> including your requested <img border="0" width="336" height="69" src="/images/w3schoolslogo.gif" alt="W3Schools.com" style="margin-top:5px;" />

When a simple wget or curl without the user agent returns only up to the <body> tag.

$url = 'http://www.w3schools.com/js/js_loop_for.asp';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($ch);

$doc = new DOMDocument();
$doc->loadHTML($html);
$xml = simplexml_import_dom($doc);
$images = $xml->xpath('//img');

var_dump($images);
die();

EDIT: My first post stated that there was still an issue with xpath... I was just not doing my due diligence and the updated code above works great. I forgot to force curl to output to a string rather then print to the screen(as it does by default).

HTML DOMNodelist?

Some DOMDocument debugging hints.

If applicable upgrade to the latest PHP 5.4 because it will give you more information on var_dump for DOMDocument and friends.

I take your small example and will add some hints how to debug the code:

$dom = new DOMDocument;
$html = 'http://localhost/foo/index.php';
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $node) {
echo $dom->saveHtml($node), PHP_EOL;
}

Did the loading work? That is this line:

$dom->loadHTML($html);

You can take a look inside the document by outputting it's content. If you display that in the browser you need to look into the source of your document or you just change the output with htmlspecialchars:

var_dump(htmlspecialchars($dom->saveHTML()));

This will give you the documented as loaded in the HTML variant verbatim inside your browser.

The next part you might want to debug is the result of getElementsByTagName:

foreach ($dom->getElementsByTagName('a') as $node) {

First assign it to a variable, and then check the length if it's not NULL or FALSE:

$aTags = $dom->getElementsByTagName('a');
var_dump(htmlspecialchars($aTags), $aTags->length());

The length will tell you how many elements were matched.

Example/Demo:

<?php

$dom = new DOMDocument;
$html = 'http://localhost/foo/index.php';
$dom->loadHTML($html);
echo 'Document HTML loaded: ', var_dump($dom->saveHTML()), "\n";
$aTags = $dom->getElementsByTagName('a');
echo 'A Elements found: ', var_dump($aTags->length), "\n";
foreach ($aTags as $node) {
echo $dom->saveHtml($node), "\n";
}

Output:

Document HTML loaded: string(171) "<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>http://localhost/foo/index.php</p></body></html>
"

A Elements found: int(0)

Hope this is helpful.

PHP Dom XPath - Why isn't it working?

The reason you are not getting any results is that there is no <a> elements that satisfy both conditions.

These are the links containing "3499047" in @href:

<a href="showthread.php?s=9bc55ab5990282a5353fb20d505d577e&t=3499047" id="thread_title_3499047">Tesco misprices and discussion (Thread 12)</a>
<a href="showthread.php?s=9bc55ab5990282a5353fb20d505d577e&t=3499047">1</a>
<a href="showthread.php?s=9bc55ab5990282a5353fb20d505d577e&t=3499047&page=2">2</a>
<a href="showthread.php?s=9bc55ab5990282a5353fb20d505d577e&t=3499047&page=3">3</a>
<a href="showthread.php?s=9bc55ab5990282a5353fb20d505d577e&t=3499047&page=110">Last Page</a>
<a href="member.php?s=9bc55ab5990282a5353fb20d505d577e&find=lastposter&t=3499047" rel="nofollow">ExiledCockney</a>
<a href="misc.php?do=whoposted&t=3499047" onclick="who(3499047); return false;">2,184</a>
<a rel="shadowbox;width=732;height=527;player=iframe;" href="wow.php?t=3499047" target="_blank" style="display: block; width: 100%; height: 100%; cursor: pointer;">
<div style="width: 100%; height: 100%; background-image: url('http://images2.moneysavingexpert.com/images/forum_style_2/misc//wow_big_faint_grey.gif');">
<div style="padding: 12px 0px 0px 0px;">
<strong>3</strong>
</div>
</div>
</a>

As you can see, none of them contain "'font-weight:bold'" in a style attribute.

In case the markup on the page has elements with your desired combination when you view it in a browser, they might have been added via javascript. DOM will not run any JavaScript, so you have to check the markup fetched with DOM.

PHP using DOM to get anchors and modify them

I have done a similar thing not long ago. You can iterate over a DOMNodeList and then get the href attribute of the anchor.

$dom = new DOMDocument;
$dom->loadHTML($content);
foreach ($dom->getElementsByTagName('a') as $node) {
$original_url = $node->getAttribute('href');
// Do something here
$node->setAttribute('href', $var);
}
$html = $dom->saveHtml();


Related Topics



Leave a reply



Submit