PHP Encoding with Domdocument

PHP DOMDocument loadHTML not encoding UTF-8 correctly

DOMDocument::loadHTML will treat your string as being in ISO-8859-1 (the HTTP/1.1 default character set) unless you tell it otherwise. This results in UTF-8 strings being interpreted incorrectly.

If your string doesn't contain an XML encoding declaration, you can prepend one to cause the string to be treated as UTF-8:

$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML('<?xml encoding="utf-8" ?>' . $profile);
echo $dom->saveHTML();

If you cannot know if the string will contain such a declaration already, there's a workaround in SmartDOMDocument which should help you:

$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML(mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8'));
echo $dom->saveHTML();

This is not a great workaround, but since not all characters can be represented in ISO-8859-1 (like these katana), it's the safest alternative.

PHP encoding with DOMDocument

Try:

$string = file_get_contents('your-xml-file.xml');
$string = mb_convert_encoding($string, 'utf-8', mb_detect_encoding($string));
// if you have not escaped entities use
$string = mb_convert_encoding($string, 'html-entities', 'utf-8');
$doc = new DOMDocument();
$doc->loadXML($string);

UTF-8 with PHP DOMDocument loadHTML?

The DOM extension was built on libxml2 whose HTML parser was made for HTML 4 - the default encoding for which is ISO-8859-1. Unless it encounters an appropriate meta tag or XML declaration stating otherwise loadHTML() will assume the content is ISO-8859-1.

Specifying the encoding when creating the DOMDocument as you have does not influence what the parser does - loading HTML (or XML) replaces both the xml version and encoding that you gave its constructor.


Workarounds:

First use mb_convert_encoding() to translate anything above the ASCII range into its html entity equivalent.

$domdoc->loadHTML(mb_convert_encoding($mystr, 'HTML-ENTITIES', 'UTF-8'));

Or hack in a meta tag or xml declaration specifying UTF-8.

$domdoc->loadHTML('<meta http-equiv="Content-Type" content="charset=utf-8" />' . $mystr);
$domdoc->loadHTML('<?xml encoding="UTF-8">' . $mystr);

PHP DOMDocument loadHTML UTF-8 encoding correctly with HTML5 doctype

I found why.

The DOM extension was built on libxml2 whose HTML parser was made for HTML 4. If an HTML5 doctype and a meta element like so <meta charset="utf-8"> HTML code will get interpreted as ISO-8859-something and non-ASCII chars will get converted into HTML entities.

However the HTML4-like version will work <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Reference: UTF-8 with PHP DOMDocument loadHTML?

Encoding special chars, DOMDocument XML and PHP

The second argument of DOMDocument::createElement() is broken - it only escapes partly and it is not part of the W3C DOM standard. In DOM the text content is a node. You can just create it and append it to the element node. This works with other node types like CDATA sections or comments as well. DOMNode::appendChild() returns the appended node, so you can nest and chain the calls.

Additionally you can set the DOMElement::$textContent property. This will replace all descendant nodes with a single text node. Do not use DOMElement::$nodeValue - it has the same problems as the argument.

$document = new DOMDocument();
$document->formatOutput = true;
$root = $document->appendChild($document->createElement('foo'));
$root
->appendChild($document->createElement('one'))
->appendChild($document->createTextNode('"foo" & <bar>'));
$root
->appendChild($document->createElement('one'))
->textContent = '"foo" & <bar>';
$root
->appendChild($document->createElement('two'))
->appendChild($document->createCDATASection('"foo" & <bar>'));
$root
->appendChild($document->createElement('three'))
->appendChild($document->createComment('"foo" & <bar>'));

echo $document->saveXML();

Output:

<?xml version="1.0"?>
<foo>
<one>"foo" & <bar></one>
<one>"foo" & <bar></one>
<two><![CDATA["foo" & <bar>]]></two>
<three>
<!--"foo" & <bar>-->
</three>
</foo>

This will escape special characters (like & and <) as needed. Quotes do need to be escaped so they won't. Other special characters depend on the encoding.

$document = new DOMDocument("1.0", "UTF-8");
$document
->appendChild($document->createElement('foo'))
->appendChild($document->createTextNode('äöü'));
echo $document->saveXML();

$document = new DOMDocument("1.0", "ASCII");
$document
->appendChild($document->createElement('foo'))
->appendChild($document->createTextNode('äöü'));
echo $document->saveXML();

Output:

<?xml version="1.0" encoding="UTF-8"?> 
<foo>äöü</foo>
<?xml version="1.0" encoding="ASCII"?>
<foo>äöü</foo>

PHP How to read XML encoding from DOMDocument

I was searching for something complex, when it was actually pretty simple: $encoding = $dom->encoding.

wrong characters encoding DOMDocument php

$html = '<html>سلام</html>';
$doc = new DOMDocument();

Converting the character encoding of string $html, to UTF-8 and then load it to the DOM, using 2 libxml predefined constants (LIBXML_HTML_NOIMPLIED & LIBXML_HTML_NODEFDTD).

The first one sets HTML_PARSE_NOIMPLIED flag, which turns off the automatic adding of implied html/body... elements (which is only avilable as of PHP 5.4.0).

The second one sets HTML_PARSE_NODEFDTD flag, which prevents a default doctype being added when one is not found. using these constants help you manage your parsing in a more flexible manner.

$doc->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

Then you define the DOM encoding, itself (the previouse definition was for input):

$doc->encoding = 'UTF-8';

Remove leading and trailing <html> & <body> tags, in case you are not using libxml 2.7.7 (as of PHP >= 5.4.0):

$doc->normalizeDocument(); //Remove leading and trailing <html> & <body> tags
print $doc->saveHTML($doc->documentElement);

Have fun!

PHP DOMDocument saveHTML not encoding cyrillic correctly

The problem is with $dom->saveHTML();, you need to add the root node as a parameter, like this:

return $dom->saveHTML((new \DOMXPath($dom))->query('/')->item(0));

The suddenly it renders the page differently, with substitution. If it does not, double check the values of $dom->encoding and $dom->substituteEntities, they should read UTF-8 and TRUE.



Related Topics



Leave a reply



Submit