PHP Domdocument - Get HTML Source of Body

PHP DOMDocument - get html source of BODY

IN your case, you do not want to work with an HTML document, but with an HTML fragment -- a portion of HTML code ;; which means DOMDocument is not quite what you need.

Instead, I would rather use something like HTMLPurifier (quoting) :

HTML Purifier is a standards-compliant
HTML filter library written in PHP.
HTML Purifier will not only remove all
malicious code (better known as XSS)
with a thoroughly audited, secure yet
permissive whitelist, it will also
make sure your documents are standards compliant, something only
achievable with a comprehensive
knowledge of W3C's specifications.

And, if you try your portion of code :

<div><p>Hello World

Using the demo page of HTMLPurifier, you get this clean HTML as an output :

<div><p>Hello World</p></div>

Much better, isn't it ? ;-)

(Note that HTMLPurfier suppots a wide range of options, and that taking a look at its documentation might not hurt)

Get entire BODY content using PHP DOM DOCUMENT

You can pass the body DOMElement to either DOMDocument::saveHTML() or DOMDocument::saveHTMLFile(), e.g.

<?php
$doc = new DOMDocument;
$doc->loadhtmlfile('http://stackoverflow.com');

$body = $doc->getElementsByTagName('body');
if ( $body && 0<$body->length ) {
$body = $body->item(0);
echo $doc->savehtml($body);
}

prints

Warning: DOMDocument::loadHTMLFile(): Unexpected end tag : p in http://stackoverflow.com, line: 2843 [...]
<body class="home-page">
<noscript><div id="noscript-padding"></div></noscript>
<div id="notify-container"></div>
<div id="overlay-header"></div>
<div id="custom-header"></div>
<div class="container">
<div id="header">
<div id="portalLink">
[...]

Get contents of BODY without DOCTYPE, HTML, HEAD and BODY tags

Since the substr() method seemed to be too much for some to swallow, here is a DOM parser method:

$d = new DOMDocument;
$mock = new DOMDocument;
$d->loadHTML(file_get_contents('/path/to/my.html'));
$body = $d->getElementsByTagName('body')->item(0);
foreach ($body->childNodes as $child){
$mock->appendChild($mock->importNode($child, true));
}

echo $mock->saveHTML();

http://codepad.org/MQVQ3XQP

Anybody wish to see that "other one", see the revisions.

PHP DOM function adding an extra p html and a body tags

The content you output is generated via DOMDocument's saveHTML method:

$content = $dom->saveHTML($root);

You reference the root node here, which is the documentElement which then is the parent element of that <html> element you do not want to output. So choose the correct element to output, e.g. the body of that document.

$body = $doc->getElementsByTagName('body')->item(0);

$content = implode(
"",
array_map([$doc, 'saveHTML'], iterator_to_array($body->childNodes))
);

echo $content;

In your case, I think instead of the <body> element you take the first <p> element.

For some related cases, a different approach might be necessary, there is also additional Q&A material here on site for that topic:

  • How to get innerHTML of DOMNode?
  • How to saveHTML of DOMDocument without HTML wrapper?

PHP DOMDocument: Get inner HTML of node

You need to have a root node to have a valid DOM document.

I suggest you to add a root node <div> to avoid to destroy a possibly existing one.

Finally, load the nodeValue of the rootNode or substr().

$body = "Some HTML with a <a href=\"http://stackoverflow.com\">http://stackoverflow.com</a>";
$body = '<div>'.$body.'</div>';

$dom = new DOMDocument;
$dom->loadHTML($body, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

foreach ($dom->getElementsByTagName('a') as $node) {
$link_text = $node->ownerDocument->saveHTML($node->childNodes[0]);
$link_href = $node->getAttribute("href");
$link_node = $dom->createTextNode($link_href);

$node->parentNode->replaceChild($link_node, $node);
}

// or probably better :
$html = $dom->saveHTML() ;
$html = substr($html,5,-7); // remove <div>
var_dump($html); // "Some HTML with a http://stackoverflow.com"

This works is the input string is :

<p>Some HTML with a <a href=\"http://stackoverflow.com\">http://stackoverflow.com</a></p>

outputs :

<p>Some HTML with a http://stackoverflow.com</p>

PHP DOMDocument without the DTD, head, and body tags?

Since PHP 5.3.6, you can use a node in echo $DOMDocument->saveHTML($the_node_you_want_to_show), before that, I've abused ->saveXML() with minor fixes. You must however have 1 surrounding included node (e.g. output is <div>...somecontent and nodex....</div>, or loop through the nodes children if you don't want have 1 surrounding tag;

$html = '';
foreach($rootnode->childNodes as $node){
$html .= $rootnode->ownerdocument->saveHTML($node);
}

How to get html code of DOMElement node?

Use the optional argument to DOMDocument::saveHTML: this says "output this element only".

return $node->ownerDocument->saveHTML($node);

Note that the argument is only available from PHP 5.3.6. Before that, you need to use DOMDocument::saveXML instead. The results may be slightly different. Also, if you already have a reference to the document, you can just do this:

$doc->saveHTML($node);


Related Topics



Leave a reply



Submit