PHP DOMDocument failing to handle utf-8 characters (☆)
DOMDocument::loadHTML()
expects a HTML string.
HTML uses the ISO-8859-1
encoding (ISO Latin Alphabet No. 1) as default per it's specs. That is since longer, see 6.1. The HTML Document Character Set. In reality that is more the default support for Windows-1252
in common webbrowsers.
I go back that far because PHP's DOMDocument is based on libxml and that brings the HTMLparser which is designed for HTML 4.0.
I'd say it's safe to assume then that you can load an ISO-8859-1
encoded string.
Your string is UTF-8
encoded. Turn all characters higher than 127 / h7F into HTML Entities and you're fine. If you don't want to do that your own, that is what mb_convert_encoding
with the HTML-ENTITIES
target encoding does:
- Those characters that have named entities, will get the named entitiy.
€ -> €
- The others get their numeric (decimal) entity, e.g.
☆ -> ☆
The following is a code example that makes the progress a bit more visible by using a callback function:
$html = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function($match) {
list($utf8) = $match;
$entity = mb_convert_encoding($utf8, 'HTML-ENTITIES', 'UTF-8');
printf("%s -> %s\n", $utf8, $entity);
return $entity;
}, $html);
This exemplary outputs for your string:
☆ -> ☆
☆ -> ☆
☆ -> ☆
Anyway, that's just for looking deeper into your string. You want to have it either converted into an encoding loadHTML
can deal with. That can be done by converting all outside of US-ASCII
into HTML Entities:
$us_ascii = mb_convert_encoding($utf_8, 'HTML-ENTITIES', 'UTF-8');
Take care that your input is actually UTF-8 encoded. If you have even mixed encodings (that can happen with some inputs) mb_convert_encoding
can only handle one encoding per string. I already outlined above how to more specifically do string replacements with the help of regular expressions, so I leave further details for now.
The other alternative is to hint the encoding. This can be done in your case by modifying the document and adding a
<meta http-equiv="content-type" content="text/html; charset=utf-8">
which is a Content-Type specifying a charset. That is also best practice for HTML strings that are not available via a webserver (e.g. saved on disk or inside a string like in your example). The webserver normally set's that as the response header.
If you don't care the misplaced warnings, you can just add it in front of the string:
$dom = new DomDocument();
$dom->loadHTML('<meta http-equiv="content-type" content="text/html; charset=utf-8">'.$html);
Per the HTML 2.0 specs, elements that can only appear in the <head>
section of a document, will be automatically placed there. This is what happens here, too. The output (pretty-print):
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
<meta charset="utf-8">
<title>Test!</title>
</head>
<body>
<h1>☆ Hello ☆ World ☆</h1>
</body>
</html>
php text encoding when GETting a webpage and then POSTing contents
The problem is in the DomDocument
, it doesn't properly handle utf-8. Converting to html-entities is the safest option and it works like magic when outputting these characters back with echo (even using cli) or urlencoding these characters. Basically DomDocument
doesn't accept utf-8 but it outputs utf-8, or so it seems. So it's a weird conversion that has to be made, so that DomDocument undoes it and everything is back to normal again.
To do this, and being $dom
a DomDocument it's enough to do this on every call to $dom->loadHTML($p)
:
$dom->loadHTML(mb_convert_encoding($p, 'html-entities', mb_detect_encoding($p)));
This is explained better in this other question: PHP DomDocument failing to handle utf-8 characters (☆)
UTF8 with file_get_contents()
file_get_contents
is known to destroy UTF8 encoding.
Try something like this:
<?php
function file_get_contents_utf8($fn) {
$content = file_get_contents($fn);
return mb_convert_encoding($content, 'UTF-8',
mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));
}
?>
If this does not work, could you please give an example URL, where this does not work? (I checked the source of the FORCEUTF8 library, and that does not look very efficient and I guess, this small function could do the same (and it's native in the PHP-code)).
Related Topics
What Is ≪=≫ (The 'Spaceship' Operator) in PHP 7
Instantiate a Class from a Variable in PHP
PHP: Merge Two Arrays While Keeping Keys Instead of Reindexing
Difference Between "Include" and "Require" in PHP
How to Check What User PHP Is Running As
PHP _Get and _Set Magic Methods
Truncate Text Containing Html, Ignoring Tags
Prevent Direct Access to File Called by Ajax Function
"[Notice] Child Pid Xxxx Exit Signal Segmentation Fault (11)" in Apache Error.Log
How to Re-Index All Subarray Elements of a Multidimensional Array
How to Get Single Value from This Multi-dimensional PHP Array