PHP DOMDocument errors/warnings on html5-tags
No, there is no way of specifying a particular doctype to use, or to modify the requirements of the existing one.
Your best workable solution is going to be to disable error reporting with libxml_use_internal_errors
:
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML('...');
libxml_clear_errors();
PHP: DOMDocument loadHTML returns an error when using HTML5 tags
I've run into this issue with PHP's DOMDoc and XSL functions. You basically have to load the document as XML. Thats the only way I got the <video>
tag to work.
Update:
You can also try adding elements & entities to the <!DOCTYPE html5 >
as long as $doc->resolveExternals = true
.
DOMDocument::loadHTML error
Header, Nav and Section are elements from HTML5. Because HTML5 developers felt it is too difficult to remember Public and System Identifiers, the DocType declaration is just:
<!DOCTYPE html>
In other words, there is no DTD to check, which will make DOM use the HTML4 Transitional DTD and that doesnt contain those elements, hence the Warnings.
To surpress the Warnings, put
libxml_use_internal_errors(true);
before the call to loadHTML
and
libxml_use_internal_errors(false);
after it.
An alternative would be to use https://github.com/html5lib/html5lib-php.
Disable warnings when loading non-well-formed HTML by DomDocument (PHP)
You can install a temporary error handler with set_error_handler
class ErrorTrap {
protected $callback;
protected $errors = array();
function __construct($callback) {
$this->callback = $callback;
}
function call() {
$result = null;
set_error_handler(array($this, 'onError'));
try {
$result = call_user_func_array($this->callback, func_get_args());
} catch (Exception $ex) {
restore_error_handler();
throw $ex;
}
restore_error_handler();
return $result;
}
function onError($errno, $errstr, $errfile, $errline) {
$this->errors[] = array($errno, $errstr, $errfile, $errline);
}
function ok() {
return count($this->errors) === 0;
}
function errors() {
return $this->errors;
}
}
Usage:
// create a DOM document and load the HTML data
$xmlDoc = new DomDocument();
$caller = new ErrorTrap(array($xmlDoc, 'loadHTML'));
// this doesn't dump out any warnings
$caller->call($fetchResult);
if (!$caller->ok()) {
var_dump($caller->errors());
}
How to get HTML from file_get_content PHP then unminify it
This is the correct code :
$html = file_get_contents("https://www.emitennews.com/search/");
$dom = new \DOMDocument();
libxml_use_internal_errors(true);
$dom->preserveWhiteSpace = false;
$dom->loadHTML('<?xml encoding="UTF-8">' . $html,LIBXML_HTML_NOIMPLIED);
$dom->formatOutput = true;
print $dom->saveXML($dom->documentElement);
The problem is the site using HTML5. So we need to put :
libxml_use_internal_errors(true);
DOMDocument loadHTML doesn't work properly on a server
To disable the warning, you can use
libxml_use_internal_errors(true);
This works for me, Manual, read on:
Background: You are loading invalid HTML. Invalid HTML is quite common, DOMDocument::loadHTML
corrects most of the problems, but gives warnings by default.
With libxml_use_internal_errors
you can control that behavior. Set it before loading the document:
$previously = libxml_use_internal_errors(true);
$doc->loadHTML($amazon);
Then after loading you can deal with the errors (if you want/need to):
/* @var LibXMLError[] $xmlErrors */
$xmlErrors = libxml_get_errors();
And finally clear them (as they will add up) and restore the previous setting if applicable:
unset($xmlErrors);
libxml_clear_errors();
libxml_use_internal_errors($previously);
References
libxml_use_internal_errors
Disable libxml errors and allow user to fetch error information as neededlibxml_clear_errors
Clear libxml error bufferlibxml_get_errors
Retrieve array of errorsLibXMLError
The libXMLError class- Stackoverflow answer to DOMDocument PHP Memory Leak (by Tak; Dec 2011)
getElementsByTagName not detecting SVG -- PHP
The question may be more easily teased apart now that the offending html string is included, and may be further investigated by spreading the input text over multiple lines. When done so, we get an input string spread over 17 lines. We can then use the warning messages and the line numbers to quickly identify the parts of the input that are hurting the parser.
'<div class="stage" id="shape_1">
<svg height="100" version="1.1" width="350" xmlns="http://www.w3.org/2000/svg" style="overflow: hidden; position: relative; left: -0.316681px; top: -0.650024px;">
<desc>Created with Raphaël 2.1.2</desc>
<defs/>
<rect x="75" y="25" width="200" height="50" r="0" rx="0" ry="0" fill="#90ee90" stroke="#000" style="fill-opacity: 0.5;" fill-opacity="0.5" stroke-width="0"/>
<path style="stroke-opacity: 0.8;" fill="none" stroke="#666666" d="M75,25L275,25" stroke-width="2" stroke-opacity="0.8"/>
<path style="stroke-opacity: 0.8;" fill="none" stroke="#666666" d="M275,25L275,75" stroke-width="2" stroke-opacity="0.8"/>
<path style="stroke-opacity: 0.8;" fill="none" stroke="#666666" d="M275,75L75,75" stroke-width="2" stroke-opacity="0.8"/>
<path style="stroke-opacity: 0.8;" fill="none" stroke="#666666" d="M75,75L75,25" stroke-width="2" stroke-opacity="0.8"/>
<text style="text-anchor: middle; font: 15px Arial;" x="175" y="85" text-anchor="middle" font="10px "Arial"" stroke="none" fill="#000000" transform="matrix(1,0,0,1,0,6.5)" font-family="Arial" font-size="15px" font-style="normal" font-weight="normal">
<tspan dy="5">x + 10 ft.</tspan>
</text>
<text style="text-anchor: end; font: 15px Arial;" x="65" y="50" text-anchor="middle" font="10px "Arial"" stroke="none" fill="#000000" font-family="Arial" font-size="15px" font-style="normal" font-weight="normal">
<tspan dy="5">x ft.</tspan>
</text>
</svg>
</div>';
Now if we stop and think about it for a second, SVG actually isn't html - it's a dialect of XML too. Same parent, but still not one and the same, we're embedding XML inside HTML for lack of better terminology, when we use an SVG in the manner that you have. With that in mind, it hardly seems surprising at all that the 17 input lines result in the following warning messages. The entities mentioned are indeed not standard HTML ones. (tag is 5th word of each line - svg, desc, defs, etc, etc)
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 2 in C:\xampp2\htdocs\*redacted*\svg.php on line 23
Warning: DOMDocument::loadHTML(): Tag desc invalid in Entity, line: 3 in C:\xampp2\htdocs\*redacted*\svg.php on line 23
Warning: DOMDocument::loadHTML(): Tag defs invalid in Entity, line: 4 in C:\xampp2\htdocs\*redacted*\svg.php on line 23
Warning: DOMDocument::loadHTML(): Tag rect invalid in Entity, line: 5 in C:\xampp2\htdocs\*redacted*\svg.php on line 23
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 6 in C:\xampp2\htdocs\*redacted*\svg.php on line 23
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 7 in C:\xampp2\htdocs\*redacted*\svg.php on line 23
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 8 in C:\xampp2\htdocs\*redacted*\svg.php on line 23
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 9 in C:\xampp2\htdocs\*redacted*\svg.php on line 23
Warning: DOMDocument::loadHTML(): Tag text invalid in Entity, line: 10 in C:\xampp2\htdocs\*redacted*\svg.php on line 23
Warning: DOMDocument::loadHTML(): Tag tspan invalid in Entity, line: 11 in C:\xampp2\htdocs\*redacted*\svg.php on line 23
Warning: DOMDocument::loadHTML(): Tag text invalid in Entity, line: 13 in C:\xampp2\htdocs\*redacted*\svg.php on line 23
Warning: DOMDocument::loadHTML(): Tag tspan invalid in Entity, line: 14 in C:\xampp2\htdocs\*redacted*\svg.php on line 23
So, what to do? Simple. Rather than trying to load XML with the loadHTML method, simply use the loadXML method instead. Once done, the output becomes this:
Array ( [0] => DOMElement Object ( [tagName] => svg [schemaTypeInfo] => [nodeName] => svg [nodeValue] => Created with Raphaël 2.1.2 x + 10 ft. x ft. [nodeType] => 1 [parentNode] => (object value omitted) [childNodes] => (object value omitted) [firstChild] => (object value omitted) [lastChild] => (object value omitted) [previousSibling] => (object value omitted) [nextSibling] => (object value omitted) [attributes] => (object value omitted) [ownerDocument] => (object value omitted) [namespaceURI] => http://www.w3.org/2000/svg [prefix] => [localName] => svg [baseURI] => file:/C:/xampp2/htdocs/*redacted*/ [textContent] => Created with Raphaël 2.1.2 x + 10 ft. x ft. ) )
set tags in html using domdocument and preg_replace_callback
I don't think building a recursive function to walk the DOM is usefull when you can use an XPath query. Also, I'm not sure that preg_replace_callback
is an adapted function for this case. I prefer to use preg_split
. Here is an example:
$html = 'this is just some example text.';
$terms = array(
'example'=>'explanation about example'
);
// sort by reverse order of key size
// (to be sure that the longest string always wins instead of the first in the pattern)
uksort($terms, function ($a, $b) {
$diff = mb_strlen($b) - mb_strlen($a);
return ($diff) ? $diff : strcmp($a, $b);
});
// build the pattern inside a capture group (to have delimiters in the results with the PREG_SPLIT_DELIM_CAPTURE option)
$pattern = '~\b(' . implode('|', array_map(function($i) { return preg_quote($i, '~'); }, array_keys($terms))) . ')\b~i';
// prevent eventual html errors to be displayed
$libxmlInternalErrors = libxml_use_internal_errors(true);
// determine if the html string have a root html element already, if not add a fake root.
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$fakeRootElement = false;
if ( $dom->documentElement->nodeName !== 'html' ) {
$dom->loadHTML("<div>$html</div>", LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
$fakeRootElement = true;
}
libxml_use_internal_errors($libxmlInternalErrors);
// find all text nodes (not already included in a link or between other unwanted tags)
$xp = new DOMXPath($dom);
$textNodes = $xp->query('//text()[not(ancestor::a)][not(ancestor::style)][not(ancestor::script)]');
// replacement
foreach ($textNodes as $textNode) {
$parts = preg_split($pattern, $textNode->nodeValue, -1, PREG_SPLIT_DELIM_CAPTURE);
$fragment = $dom->createDocumentFragment();
foreach ($parts as $k=>$part) {
if ($k&1) {
$anchor = $dom->createElement('a', $part);
$anchor->setAttribute('class', 'text-info');
$anchor->setAttribute('data-toggle', 'tooltip');
$anchor->setAttribute('data-original-title', $terms[strtolower($part)]);
$fragment->appendChild($anchor);
} else {
$fragment->appendChild($dom->createTextNode($part));
}
}
$textNode->parentNode->replaceChild($fragment, $textNode);
}
// building of the result string
$result = '';
if ( $fakeRootElement ) {
foreach ($dom->documentElement->childNodes as $childNode) {
$result .= $dom->saveHTML($childNode);
}
} else {
$result = $dom->saveHTML();
}
echo $result;
demo
Feel free to put that into one or more functions/methods, but keep in mind that this kind of editing has a non-neglictable weight and should be used each time the html is edited (and not each time the html is displayed).
Related Topics
Difference Between Method Calls $Model-≫Relation(); and $Model-≫Relation;
Error 330 (Net::Err_Content_Decoding_Failed):
Variable-Length Lookbehind-Assertion Alternatives For Regular Expressions
What Is the Default Lifetime of a Session
How to Select First 10 Words of a Sentence
How to Do Left Join in Doctrine
Utility of Http Header "Content-Type: Application/Force-Download" For Mobile
How to Disable Output Buffering in PHP
Submit an HTML Form With Empty Checkboxes
Difference Between 2 Dates in Seconds
How to Get the Classname from a Static Call in an Extended PHP Class
What Are the Differences Between Composer Update and Composer Install
PHP Short Hash Like Url-Shortening Websites
Generating a Drop Down List of Timezones With PHP
Is There an Equivalent For Var_Dump (PHP) in JavaScript