PHP DOMDocument - get html source of BODY
IN your case, you do not want to work with an HTML document, but with an HTML fragment -- a portion of HTML code ;; which means DOMDocument is not quite what you need.
Instead, I would rather use something like HTMLPurifier (quoting) :
HTML Purifier is a standards-compliant
HTML filter library written in PHP.
HTML Purifier will not only remove all
malicious code (better known as XSS)
with a thoroughly audited, secure yet
permissive whitelist, it will also
make sure your documents are standards compliant, something only
achievable with a comprehensive
knowledge of W3C's specifications.
And, if you try your portion of code :
<div><p>Hello World
Using the demo page of HTMLPurifier, you get this clean HTML as an output :
<div><p>Hello World</p></div>
Much better, isn't it ? ;-)
(Note that HTMLPurfier suppots a wide range of options, and that taking a look at its documentation might not hurt)
Get entire BODY content using PHP DOM DOCUMENT
You can pass the body DOMElement to either DOMDocument::saveHTML() or DOMDocument::saveHTMLFile(), e.g.
<?php
$doc = new DOMDocument;
$doc->loadhtmlfile('http://stackoverflow.com');
$body = $doc->getElementsByTagName('body');
if ( $body && 0<$body->length ) {
$body = $body->item(0);
echo $doc->savehtml($body);
}
prints
Warning: DOMDocument::loadHTMLFile(): Unexpected end tag : p in http://stackoverflow.com, line: 2843 [...]
<body class="home-page">
<noscript><div id="noscript-padding"></div></noscript>
<div id="notify-container"></div>
<div id="overlay-header"></div>
<div id="custom-header"></div>
<div class="container">
<div id="header">
<div id="portalLink">
[...]
Get contents of BODY without DOCTYPE, HTML, HEAD and BODY tags
Since the substr()
method seemed to be too much for some to swallow, here is a DOM parser method:
$d = new DOMDocument;
$mock = new DOMDocument;
$d->loadHTML(file_get_contents('/path/to/my.html'));
$body = $d->getElementsByTagName('body')->item(0);
foreach ($body->childNodes as $child){
$mock->appendChild($mock->importNode($child, true));
}
echo $mock->saveHTML();
http://codepad.org/MQVQ3XQP
Anybody wish to see that "other one", see the revisions.
PHP DOM function adding an extra p html and a body tags
The content you output is generated via DOMDocument's saveHTML method:
$content = $dom->saveHTML($root);
You reference the root node here, which is the documentElement which then is the parent element of that <html>
element you do not want to output. So choose the correct element to output, e.g. the body of that document.
$body = $doc->getElementsByTagName('body')->item(0);
$content = implode(
"",
array_map([$doc, 'saveHTML'], iterator_to_array($body->childNodes))
);
echo $content;
In your case, I think instead of the <body>
element you take the first <p>
element.
For some related cases, a different approach might be necessary, there is also additional Q&A material here on site for that topic:
- How to get innerHTML of DOMNode?
- How to saveHTML of DOMDocument without HTML wrapper?
PHP DOMDocument: Get inner HTML of node
You need to have a root node to have a valid DOM document.
I suggest you to add a root node <div>
to avoid to destroy a possibly existing one.
Finally, load the nodeValue
of the rootNode or substr()
.
$body = "Some HTML with a <a href=\"http://stackoverflow.com\">http://stackoverflow.com</a>";
$body = '<div>'.$body.'</div>';
$dom = new DOMDocument;
$dom->loadHTML($body, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
foreach ($dom->getElementsByTagName('a') as $node) {
$link_text = $node->ownerDocument->saveHTML($node->childNodes[0]);
$link_href = $node->getAttribute("href");
$link_node = $dom->createTextNode($link_href);
$node->parentNode->replaceChild($link_node, $node);
}
// or probably better :
$html = $dom->saveHTML() ;
$html = substr($html,5,-7); // remove <div>
var_dump($html); // "Some HTML with a http://stackoverflow.com"
This works is the input string is :
<p>Some HTML with a <a href=\"http://stackoverflow.com\">http://stackoverflow.com</a></p>
outputs :
<p>Some HTML with a http://stackoverflow.com</p>
PHP DOMDocument without the DTD, head, and body tags?
Since PHP 5.3.6, you can use a node in echo $DOMDocument->saveHTML($the_node_you_want_to_show)
, before that, I've abused ->saveXML()
with minor fixes. You must however have 1 surrounding included node (e.g. output is <div>...somecontent and nodex....</div>
, or loop through the nodes children if you don't want have 1 surrounding tag;
$html = '';
foreach($rootnode->childNodes as $node){
$html .= $rootnode->ownerdocument->saveHTML($node);
}
How to get html code of DOMElement node?
Use the optional argument to DOMDocument::saveHTML
: this says "output this element only".
return $node->ownerDocument->saveHTML($node);
Note that the argument is only available from PHP 5.3.6. Before that, you need to use DOMDocument::saveXML
instead. The results may be slightly different. Also, if you already have a reference to the document, you can just do this:
$doc->saveHTML($node);
Related Topics
Php: How to Add Leading Zeros/Zero Padding to Float via Sprintf()
Why Ob_Start() Must Come Ahead of Session_Start() to Work in PHP
How to Format a PHP Include() Absolute (Rather Than Relative) Path
In Laravel 5, How to Disable Verifycsrftoken Middleware for Specific Route
PHP Remove/Fix Module Not Found or Already Loaded Warnings
Access JSON Object Name in PHP
In PHP, Which Is Faster: Preg_Split or Explode
Running Job in the Background from Perl Without Waiting for Return
Easiest Way to Implode() a Two-Dimensional Array
Soapclient: How to Pass Multiple Elements with Same Name
How to JSON_Encode PHP Array But the Keys Without Quotes
Strpos() with Multiple Needles
PHP Post Limited to 1000 Variables
Extracting Matches from PHP Regex
Different Recipients Based on Product Category in Woocommerce Email Notification