Loadhtml Libxml_Html_Noimplied on an HTML Fragment Generates Incorrect Tags

php DOMDocument: element ending up within another

A DomDocument has to have a single root element, so it will move all following siblings inside the first top-level element.

You could most easily address this by bookending your content with a container tag e.g.

$content = '<div><figure class="image image-style-align-left">
<img src="https://placekitten.com/g/200/300"></figure>
<p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</p></div>';

PHP DOMDocument saveHTML not encoding cyrillic correctly

The problem is with $dom->saveHTML();, you need to add the root node as a parameter, like this:

return $dom->saveHTML((new \DOMXPath($dom))->query('/')->item(0));

The suddenly it renders the page differently, with substitution. If it does not, double check the values of $dom->encoding and $dom->substituteEntities, they should read UTF-8 and TRUE.

PHP domDocument works incorrectly when the node wrapper in figure?

I was unable to reproduce your problem. My guess would be a misplaced element somewhere in your source HTML. But your code can be simplified quite a bit.

There's no need to put your image nodes into an array, you can work directly with the results of DomDocument::getElementsByTagName().

As mentioned in comments you can setup DomDocument::loadHTML() not to add the doctype and implied elements, instead of removing them later with potentially tricky string manipulations.

A simple DomDocument::createElement() can be used for the element you want to append, instead of creating a new object.

Finally, the error control operator @ should generally be avoided. Instead, libxml_use_internal_errors() can be used to set the error behaviour. This allows you to examine error messages with libxml_get_errors() if desired.

$content = <<< HTML
<div class="content">
<a href="..."><img src=""></a>
<figure>
<a href="..."><img src=""></a>
<figcaption>Caption</figcaption>
</figure>
</div>
HTML;

$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
libxml_use_internal_errors(false);

foreach ($dom->getElementsByTagName('img') as $node) {
$node->parentNode->appendChild($dom->createElement("span", "11"));
}

$newHtml = $dom->saveHTML();
echo $newHtml;

Output:

<div class="content">
<a href="..."><img src=""><span>11</span></a>
<figure>
<a href="..."><img src=""><span>11</span></a>
<figcaption>Caption</figcaption>
</figure>
</div>

How does one strip tags (and their content) from an HTML string using PHP's DOMDocument?

Based on Niet the Dark Absol's comment, my solution was to simply wrap my code nippet in a div, and then use substr to remove it. Seems like an acceptable workaround for working with valid inline HTML snippets (and not the entire DOM) via DOMDocument.

$html = '<a href="#">LINK1</a> - and <i>also</i> <a href="#">LINK2</a>';

$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
$dom->validateOnParse = false;
$dom->resolveExternals = false;
$dom->substituteEntities = false;
$dom->loadHTML( '<div>'.$html.'</div>', LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD );

$list = $dom->getElementsByTagName('a');
while ($list->length > 0) {
$p = $list->item(0);
$p->parentNode->removeChild($p);
}

$result = substr($dom->saveHTML(), 5, -6);


Related Topics



Leave a reply



Submit