PHP Domdocument Loadhtml Not Encoding Utf-8 Correctly

PHP DOMDocument loadHTML not encoding UTF-8 correctly

DOMDocument::loadHTML will treat your string as being in ISO-8859-1 (the HTTP/1.1 default character set) unless you tell it otherwise. This results in UTF-8 strings being interpreted incorrectly.

If your string doesn't contain an XML encoding declaration, you can prepend one to cause the string to be treated as UTF-8:

$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML('<?xml encoding="utf-8" ?>' . $profile);
echo $dom->saveHTML();

If you cannot know if the string will contain such a declaration already, there's a workaround in SmartDOMDocument which should help you:

$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML(mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8'));
echo $dom->saveHTML();

This is not a great workaround, but since not all characters can be represented in ISO-8859-1 (like these katana), it's the safest alternative.

PHP DOMDocument loadHTML UTF-8 encoding correctly with HTML5 doctype

I found why.

The DOM extension was built on libxml2 whose HTML parser was made for HTML 4. If an HTML5 doctype and a meta element like so <meta charset="utf-8"> HTML code will get interpreted as ISO-8859-something and non-ASCII chars will get converted into HTML entities.

However the HTML4-like version will work <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Reference: UTF-8 with PHP DOMDocument loadHTML?

UTF-8 with PHP DOMDocument loadHTML?

The DOM extension was built on libxml2 whose HTML parser was made for HTML 4 - the default encoding for which is ISO-8859-1. Unless it encounters an appropriate meta tag or XML declaration stating otherwise loadHTML() will assume the content is ISO-8859-1.

Specifying the encoding when creating the DOMDocument as you have does not influence what the parser does - loading HTML (or XML) replaces both the xml version and encoding that you gave its constructor.


Workarounds:

First use mb_convert_encoding() to translate anything above the ASCII range into its html entity equivalent.

$domdoc->loadHTML(mb_convert_encoding($mystr, 'HTML-ENTITIES', 'UTF-8'));

Or hack in a meta tag or xml declaration specifying UTF-8.

$domdoc->loadHTML('<meta http-equiv="Content-Type" content="charset=utf-8" />' . $mystr);
$domdoc->loadHTML('<?xml encoding="UTF-8">' . $mystr);

PHP DOMDocument saveHTML not encoding cyrillic correctly

The problem is with $dom->saveHTML();, you need to add the root node as a parameter, like this:

return $dom->saveHTML((new \DOMXPath($dom))->query('/')->item(0));

The suddenly it renders the page differently, with substitution. If it does not, double check the values of $dom->encoding and $dom->substituteEntities, they should read UTF-8 and TRUE.

PHP DOMDocument failing to handle utf-8 characters (☆)

DOMDocument::loadHTML() expects a HTML string.

HTML uses the ISO-8859-1 encoding (ISO Latin Alphabet No. 1) as default per it's specs. That is since longer, see 6.1. The HTML Document Character Set. In reality that is more the default support for Windows-1252 in common webbrowsers.

I go back that far because PHP's DOMDocument is based on libxml and that brings the HTMLparser which is designed for HTML 4.0.

I'd say it's safe to assume then that you can load an ISO-8859-1 encoded string.

Your string is UTF-8 encoded. Turn all characters higher than 127 / h7F into HTML Entities and you're fine. If you don't want to do that your own, that is what mb_convert_encoding with the HTML-ENTITIES target encoding does:

  • Those characters that have named entities, will get the named entitiy. € -> €
  • The others get their numeric (decimal) entity, e.g. ☆ -> ☆

The following is a code example that makes the progress a bit more visible by using a callback function:

$html = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function($match) {
list($utf8) = $match;
$entity = mb_convert_encoding($utf8, 'HTML-ENTITIES', 'UTF-8');
printf("%s -> %s\n", $utf8, $entity);
return $entity;
}, $html);

This exemplary outputs for your string:

☆ -> ☆
☆ -> ☆
☆ -> ☆

Anyway, that's just for looking deeper into your string. You want to have it either converted into an encoding loadHTML can deal with. That can be done by converting all outside of US-ASCII into HTML Entities:

$us_ascii = mb_convert_encoding($utf_8, 'HTML-ENTITIES', 'UTF-8');

Take care that your input is actually UTF-8 encoded. If you have even mixed encodings (that can happen with some inputs) mb_convert_encoding can only handle one encoding per string. I already outlined above how to more specifically do string replacements with the help of regular expressions, so I leave further details for now.

The other alternative is to hint the encoding. This can be done in your case by modifying the document and adding a

<meta http-equiv="content-type" content="text/html; charset=utf-8">

which is a Content-Type specifying a charset. That is also best practice for HTML strings that are not available via a webserver (e.g. saved on disk or inside a string like in your example). The webserver normally set's that as the response header.

If you don't care the misplaced warnings, you can just add it in front of the string:

$dom = new DomDocument();
$dom->loadHTML('<meta http-equiv="content-type" content="text/html; charset=utf-8">'.$html);

Per the HTML 2.0 specs, elements that can only appear in the <head> section of a document, will be automatically placed there. This is what happens here, too. The output (pretty-print):

<!DOCTYPE html>
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
<meta charset="utf-8">
<title>Test!</title>
</head>
<body>
<h1>☆ Hello ☆ World ☆</h1>
</body>
</html>

PHP encoding with DOMDocument

Try:

$string = file_get_contents('your-xml-file.xml');
$string = mb_convert_encoding($string, 'utf-8', mb_detect_encoding($string));
// if you have not escaped entities use
$string = mb_convert_encoding($string, 'html-entities', 'utf-8');
$doc = new DOMDocument();
$doc->loadXML($string);


Related Topics



Leave a reply



Submit