PHP Domdocument Errors/Warnings on Html5-Tags

PHP DOMDocument errors/warnings on html5-tags

No, there is no way of specifying a particular doctype to use, or to modify the requirements of the existing one.

Your best workable solution is going to be to disable error reporting with libxml_use_internal_errors:

$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML('...');
libxml_clear_errors();

PHP: DOMDocument loadHTML returns an error when using HTML5 tags

I've run into this issue with PHP's DOMDoc and XSL functions. You basically have to load the document as XML. Thats the only way I got the <video> tag to work.

Update:
You can also try adding elements & entities to the <!DOCTYPE html5 > as long as $doc->resolveExternals = true.

DOMDocument::loadHTML error

Header, Nav and Section are elements from HTML5. Because HTML5 developers felt it is too difficult to remember Public and System Identifiers, the DocType declaration is just:

<!DOCTYPE html>

In other words, there is no DTD to check, which will make DOM use the HTML4 Transitional DTD and that doesnt contain those elements, hence the Warnings.

To surpress the Warnings, put

libxml_use_internal_errors(true);

before the call to loadHTML and

libxml_use_internal_errors(false);

after it.

An alternative would be to use https://github.com/html5lib/html5lib-php.

Disable warnings when loading non-well-formed HTML by DomDocument (PHP)

You can install a temporary error handler with set_error_handler

class ErrorTrap {
  protected $callback;
  protected $errors = array();
  function __construct($callback) {
    $this->callback = $callback;
  }
  function call() {
    $result = null;
    set_error_handler(array($this, 'onError'));
    try {
      $result = call_user_func_array($this->callback, func_get_args());
    } catch (Exception $ex) {
      restore_error_handler();        
      throw $ex;
    }
    restore_error_handler();
    return $result;
  }
  function onError($errno, $errstr, $errfile, $errline) {
    $this->errors[] = array($errno, $errstr, $errfile, $errline);
  }
  function ok() {
    return count($this->errors) === 0;
  }
  function errors() {
    return $this->errors;
  }
}

Usage:

// create a DOM document and load the HTML data
$xmlDoc = new DomDocument();
$caller = new ErrorTrap(array($xmlDoc, 'loadHTML'));
// this doesn't dump out any warnings
$caller->call($fetchResult);
if (!$caller->ok()) {
  var_dump($caller->errors());
}

How to get HTML from file_get_content PHP then unminify it

This is the correct code :

$html = file_get_contents("https://www.emitennews.com/search/");                                        
$dom = new \DOMDocument();
libxml_use_internal_errors(true);
$dom->preserveWhiteSpace = false;
$dom->loadHTML('<?xml encoding="UTF-8">' . $html,LIBXML_HTML_NOIMPLIED);
$dom->formatOutput = true;
print $dom->saveXML($dom->documentElement);

The problem is the site using HTML5. So we need to put :

libxml_use_internal_errors(true);

DOMDocument loadHTML doesn't work properly on a server

To disable the warning, you can use

libxml_use_internal_errors(true);

This works for me, Manual, read on:

Background: You are loading invalid HTML. Invalid HTML is quite common, DOMDocument::loadHTML corrects most of the problems, but gives warnings by default.

With libxml_use_internal_errors you can control that behavior. Set it before loading the document:

$previously = libxml_use_internal_errors(true);
$doc->loadHTML($amazon);

Then after loading you can deal with the errors (if you want/need to):

/* @var LibXMLError[] $xmlErrors */
$xmlErrors = libxml_get_errors();

And finally clear them (as they will add up) and restore the previous setting if applicable:

unset($xmlErrors);
libxml_clear_errors();
libxml_use_internal_errors($previously);

References

libxml_use_internal_errors Disable libxml errors and allow user to fetch error information as needed
libxml_clear_errors Clear libxml error buffer
libxml_get_errors Retrieve array of errors
LibXMLError The libXMLError class
Stackoverflow answer to DOMDocument PHP Memory Leak (by Tak; Dec 2011)

getElementsByTagName not detecting SVG -- PHP

The question may be more easily teased apart now that the offending html string is included, and may be further investigated by spreading the input text over multiple lines. When done so, we get an input string spread over 17 lines. We can then use the warning messages and the line numbers to quickly identify the parts of the input that are hurting the parser.

'<div class="stage" id="shape_1">
    <svg height="100" version="1.1" width="350" xmlns="http://www.w3.org/2000/svg" style="overflow: hidden; position: relative; left: -0.316681px; top: -0.650024px;">
    <desc>Created with Raphaël 2.1.2</desc>
    <defs/>
    <rect x="75" y="25" width="200" height="50" r="0" rx="0" ry="0" fill="#90ee90" stroke="#000" style="fill-opacity: 0.5;" fill-opacity="0.5" stroke-width="0"/>
    <path style="stroke-opacity: 0.8;" fill="none" stroke="#666666" d="M75,25L275,25" stroke-width="2" stroke-opacity="0.8"/>
    <path style="stroke-opacity: 0.8;" fill="none" stroke="#666666" d="M275,25L275,75" stroke-width="2" stroke-opacity="0.8"/>
    <path style="stroke-opacity: 0.8;" fill="none" stroke="#666666" d="M275,75L75,75" stroke-width="2" stroke-opacity="0.8"/>
    <path style="stroke-opacity: 0.8;" fill="none" stroke="#666666" d="M75,75L75,25" stroke-width="2" stroke-opacity="0.8"/>
    <text style="text-anchor: middle; font: 15px Arial;" x="175" y="85" text-anchor="middle" font="10px "Arial"" stroke="none" fill="#000000" transform="matrix(1,0,0,1,0,6.5)" font-family="Arial" font-size="15px" font-style="normal" font-weight="normal">
        <tspan dy="5">x + 10 ft.</tspan>
    </text>
    <text style="text-anchor: end; font: 15px Arial;" x="65" y="50" text-anchor="middle" font="10px "Arial"" stroke="none" fill="#000000" font-family="Arial" font-size="15px" font-style="normal" font-weight="normal">
    <tspan dy="5">x ft.</tspan>
    </text>
    </svg>
</div>';

Now if we stop and think about it for a second, SVG actually isn't html - it's a dialect of XML too. Same parent, but still not one and the same, we're embedding XML inside HTML for lack of better terminology, when we use an SVG in the manner that you have. With that in mind, it hardly seems surprising at all that the 17 input lines result in the following warning messages. The entities mentioned are indeed not standard HTML ones. (tag is 5th word of each line - svg, desc, defs, etc, etc)

Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 2 in C:\xampp2\htdocs\*redacted*\svg.php on line 23

Warning: DOMDocument::loadHTML(): Tag desc invalid in Entity, line: 3 in C:\xampp2\htdocs\*redacted*\svg.php on line 23

Warning: DOMDocument::loadHTML(): Tag defs invalid in Entity, line: 4 in C:\xampp2\htdocs\*redacted*\svg.php on line 23

Warning: DOMDocument::loadHTML(): Tag rect invalid in Entity, line: 5 in C:\xampp2\htdocs\*redacted*\svg.php on line 23

Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 6 in C:\xampp2\htdocs\*redacted*\svg.php on line 23

Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 7 in C:\xampp2\htdocs\*redacted*\svg.php on line 23

Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 8 in C:\xampp2\htdocs\*redacted*\svg.php on line 23

Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 9 in C:\xampp2\htdocs\*redacted*\svg.php on line 23

Warning: DOMDocument::loadHTML(): Tag text invalid in Entity, line: 10 in C:\xampp2\htdocs\*redacted*\svg.php on line 23

Warning: DOMDocument::loadHTML(): Tag tspan invalid in Entity, line: 11 in C:\xampp2\htdocs\*redacted*\svg.php on line 23

Warning: DOMDocument::loadHTML(): Tag text invalid in Entity, line: 13 in C:\xampp2\htdocs\*redacted*\svg.php on line 23

Warning: DOMDocument::loadHTML(): Tag tspan invalid in Entity, line: 14 in C:\xampp2\htdocs\*redacted*\svg.php on line 23

So, what to do? Simple. Rather than trying to load XML with the loadHTML method, simply use the loadXML method instead. Once done, the output becomes this:

Array ( [0] => DOMElement Object ( [tagName] => svg [schemaTypeInfo] => [nodeName] => svg [nodeValue] => Created with Raphaël 2.1.2 x + 10 ft. x ft. [nodeType] => 1 [parentNode] => (object value omitted) [childNodes] => (object value omitted) [firstChild] => (object value omitted) [lastChild] => (object value omitted) [previousSibling] => (object value omitted) [nextSibling] => (object value omitted) [attributes] => (object value omitted) [ownerDocument] => (object value omitted) [namespaceURI] => http://www.w3.org/2000/svg [prefix] => [localName] => svg [baseURI] => file:/C:/xampp2/htdocs/*redacted*/ [textContent] => Created with Raphaël 2.1.2 x + 10 ft. x ft. ) )

set tags in html using domdocument and preg_replace_callback

I don't think building a recursive function to walk the DOM is usefull when you can use an XPath query. Also, I'm not sure that preg_replace_callback is an adapted function for this case. I prefer to use preg_split. Here is an example:

$html = 'this is just some example text.';

$terms = array(
   'example'=>'explanation about example'
);

// sort by reverse order of key size
// (to be sure that the longest string always wins instead of the first in the pattern)

uksort($terms, function ($a, $b) {
    $diff = mb_strlen($b) - mb_strlen($a);

    return ($diff) ? $diff : strcmp($a, $b);
});

// build the pattern inside a capture group (to have delimiters in the results with the PREG_SPLIT_DELIM_CAPTURE option)
$pattern = '~\b(' . implode('|', array_map(function($i) { return preg_quote($i, '~'); }, array_keys($terms))) . ')\b~i';

// prevent eventual html errors to be displayed
$libxmlInternalErrors = libxml_use_internal_errors(true);

// determine if the html string have a root html element already, if not add a fake root.
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$fakeRootElement = false;

if ( $dom->documentElement->nodeName !== 'html' ) {
    $dom->loadHTML("<div>$html</div>", LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
    $fakeRootElement = true;
}

libxml_use_internal_errors($libxmlInternalErrors);

// find all text nodes (not already included in a link or between other unwanted tags)
$xp = new DOMXPath($dom);
$textNodes = $xp->query('//text()[not(ancestor::a)][not(ancestor::style)][not(ancestor::script)]');

// replacement
foreach ($textNodes as $textNode) {
    $parts = preg_split($pattern, $textNode->nodeValue, -1, PREG_SPLIT_DELIM_CAPTURE);
    $fragment = $dom->createDocumentFragment();
    foreach ($parts as $k=>$part) {
        if ($k&1) {
            $anchor = $dom->createElement('a', $part);
            $anchor->setAttribute('class', 'text-info');
            $anchor->setAttribute('data-toggle', 'tooltip');
            $anchor->setAttribute('data-original-title', $terms[strtolower($part)]);
            $fragment->appendChild($anchor);
        } else {
            $fragment->appendChild($dom->createTextNode($part));
        }
    }
    $textNode->parentNode->replaceChild($fragment, $textNode);
}


// building of the result string
$result = '';

if ( $fakeRootElement ) {
    foreach ($dom->documentElement->childNodes as $childNode) {
        $result .= $dom->saveHTML($childNode);
    }
} else {
    $result = $dom->saveHTML();
}

echo $result;

demo

Feel free to put that into one or more functions/methods, but keep in mind that this kind of editing has a non-neglictable weight and should be used each time the html is edited (and not each time the html is displayed).

PHP Domdocument Errors/Warnings on Html5-Tags