PHP DOMDocument loadHTML not encoding UTF-8 correctly
DOMDocument::loadHTML
will treat your string as being in ISO-8859-1 (the HTTP/1.1 default character set) unless you tell it otherwise. This results in UTF-8 strings being interpreted incorrectly.
If your string doesn't contain an XML encoding declaration, you can prepend one to cause the string to be treated as UTF-8:
$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML('<?xml encoding="utf-8" ?>' . $profile);
echo $dom->saveHTML();
If you cannot know if the string will contain such a declaration already, there's a workaround in SmartDOMDocument which should help you:
$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML(mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8'));
echo $dom->saveHTML();
This is not a great workaround, but since not all characters can be represented in ISO-8859-1 (like these katana), it's the safest alternative.
PHP DOMDocument loadHTML UTF-8 encoding correctly with HTML5 doctype
I found why.
The DOM extension was built on libxml2 whose HTML parser was made for HTML 4. If an HTML5 doctype and a meta element like so <meta charset="utf-8">
HTML code will get interpreted as ISO-8859-something and non-ASCII chars will get converted into HTML entities.
However the HTML4-like version will work <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
Reference: UTF-8 with PHP DOMDocument loadHTML?
UTF-8 with PHP DOMDocument loadHTML?
The DOM extension was built on libxml2 whose HTML parser was made for HTML 4 - the default encoding for which is ISO-8859-1. Unless it encounters an appropriate meta tag or XML declaration stating otherwise loadHTML()
will assume the content is ISO-8859-1.
Specifying the encoding when creating the DOMDocument as you have does not influence what the parser does - loading HTML (or XML) replaces both the xml version and encoding that you gave its constructor.
Workarounds:
First use mb_convert_encoding()
to translate anything above the ASCII range into its html entity equivalent.
$domdoc->loadHTML(mb_convert_encoding($mystr, 'HTML-ENTITIES', 'UTF-8'));
Or hack in a meta tag or xml declaration specifying UTF-8.
$domdoc->loadHTML('<meta http-equiv="Content-Type" content="charset=utf-8" />' . $mystr);
$domdoc->loadHTML('<?xml encoding="UTF-8">' . $mystr);
PHP DOMDocument saveHTML not encoding cyrillic correctly
The problem is with $dom->saveHTML();
, you need to add the root node as a parameter, like this:
return $dom->saveHTML((new \DOMXPath($dom))->query('/')->item(0));
The suddenly it renders the page differently, with substitution. If it does not, double check the values of $dom->encoding
and $dom->substituteEntities
, they should read UTF-8
and TRUE
.
PHP DOMDocument failing to handle utf-8 characters (☆)
DOMDocument::loadHTML()
expects a HTML string.
HTML uses the ISO-8859-1
encoding (ISO Latin Alphabet No. 1) as default per it's specs. That is since longer, see 6.1. The HTML Document Character Set. In reality that is more the default support for Windows-1252
in common webbrowsers.
I go back that far because PHP's DOMDocument is based on libxml and that brings the HTMLparser which is designed for HTML 4.0.
I'd say it's safe to assume then that you can load an ISO-8859-1
encoded string.
Your string is UTF-8
encoded. Turn all characters higher than 127 / h7F into HTML Entities and you're fine. If you don't want to do that your own, that is what mb_convert_encoding
with the HTML-ENTITIES
target encoding does:
- Those characters that have named entities, will get the named entitiy.
€ -> €
- The others get their numeric (decimal) entity, e.g.
☆ -> ☆
The following is a code example that makes the progress a bit more visible by using a callback function:
$html = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function($match) {
list($utf8) = $match;
$entity = mb_convert_encoding($utf8, 'HTML-ENTITIES', 'UTF-8');
printf("%s -> %s\n", $utf8, $entity);
return $entity;
}, $html);
This exemplary outputs for your string:
☆ -> ☆
☆ -> ☆
☆ -> ☆
Anyway, that's just for looking deeper into your string. You want to have it either converted into an encoding loadHTML
can deal with. That can be done by converting all outside of US-ASCII
into HTML Entities:
$us_ascii = mb_convert_encoding($utf_8, 'HTML-ENTITIES', 'UTF-8');
Take care that your input is actually UTF-8 encoded. If you have even mixed encodings (that can happen with some inputs) mb_convert_encoding
can only handle one encoding per string. I already outlined above how to more specifically do string replacements with the help of regular expressions, so I leave further details for now.
The other alternative is to hint the encoding. This can be done in your case by modifying the document and adding a
<meta http-equiv="content-type" content="text/html; charset=utf-8">
which is a Content-Type specifying a charset. That is also best practice for HTML strings that are not available via a webserver (e.g. saved on disk or inside a string like in your example). The webserver normally set's that as the response header.
If you don't care the misplaced warnings, you can just add it in front of the string:
$dom = new DomDocument();
$dom->loadHTML('<meta http-equiv="content-type" content="text/html; charset=utf-8">'.$html);
Per the HTML 2.0 specs, elements that can only appear in the <head>
section of a document, will be automatically placed there. This is what happens here, too. The output (pretty-print):
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
<meta charset="utf-8">
<title>Test!</title>
</head>
<body>
<h1>☆ Hello ☆ World ☆</h1>
</body>
</html>
PHP encoding with DOMDocument
Try:
$string = file_get_contents('your-xml-file.xml');
$string = mb_convert_encoding($string, 'utf-8', mb_detect_encoding($string));
// if you have not escaped entities use
$string = mb_convert_encoding($string, 'html-entities', 'utf-8');
$doc = new DOMDocument();
$doc->loadXML($string);
Related Topics
If Block Inside Echo Statement
How to Check If a Url Exists Via PHP
Creating the Singleton Design Pattern in PHP5
Gcm With PHP (Google Cloud Messaging)
What's Wrong With Using $_Request[]
How to Create Custom Helper Functions in Laravel
Download File to Server from Url
Among $_Request, $_Get and $_Post Which One Is the Fastest
How to Run PHP from Windows Command Line in Wampserver
Strtotime() Doesn't Work With Dd/Mm/Yyyy Format
How to Find All Youtube Video Ids in a String Using a Regex
Bind Multiple Parameters into MySQLi Query
How to Update Code That Uses the Deprecated Each() Function
Is "MySQLi_Real_Escape_String" Enough to Avoid SQL Injection or Other SQL Attacks