PHP DOMDocument loadHTML not encoding UTF-8 correctly
DOMDocument::loadHTML
will treat your string as being in ISO-8859-1 (the HTTP/1.1 default character set) unless you tell it otherwise. This results in UTF-8 strings being interpreted incorrectly.
If your string doesn't contain an XML encoding declaration, you can prepend one to cause the string to be treated as UTF-8:
$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML('<?xml encoding="utf-8" ?>' . $profile);
echo $dom->saveHTML();
If you cannot know if the string will contain such a declaration already, there's a workaround in SmartDOMDocument which should help you:
$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML(mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8'));
echo $dom->saveHTML();
This is not a great workaround, but since not all characters can be represented in ISO-8859-1 (like these katana), it's the safest alternative.
PHP encoding with DOMDocument
Try:
$string = file_get_contents('your-xml-file.xml');
$string = mb_convert_encoding($string, 'utf-8', mb_detect_encoding($string));
// if you have not escaped entities use
$string = mb_convert_encoding($string, 'html-entities', 'utf-8');
$doc = new DOMDocument();
$doc->loadXML($string);
UTF-8 with PHP DOMDocument loadHTML?
The DOM extension was built on libxml2 whose HTML parser was made for HTML 4 - the default encoding for which is ISO-8859-1. Unless it encounters an appropriate meta tag or XML declaration stating otherwise loadHTML()
will assume the content is ISO-8859-1.
Specifying the encoding when creating the DOMDocument as you have does not influence what the parser does - loading HTML (or XML) replaces both the xml version and encoding that you gave its constructor.
Workarounds:
First use mb_convert_encoding()
to translate anything above the ASCII range into its html entity equivalent.
$domdoc->loadHTML(mb_convert_encoding($mystr, 'HTML-ENTITIES', 'UTF-8'));
Or hack in a meta tag or xml declaration specifying UTF-8.
$domdoc->loadHTML('<meta http-equiv="Content-Type" content="charset=utf-8" />' . $mystr);
$domdoc->loadHTML('<?xml encoding="UTF-8">' . $mystr);
PHP DOMDocument loadHTML UTF-8 encoding correctly with HTML5 doctype
I found why.
The DOM extension was built on libxml2 whose HTML parser was made for HTML 4. If an HTML5 doctype and a meta element like so <meta charset="utf-8">
HTML code will get interpreted as ISO-8859-something and non-ASCII chars will get converted into HTML entities.
However the HTML4-like version will work <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
Reference: UTF-8 with PHP DOMDocument loadHTML?
Encoding special chars, DOMDocument XML and PHP
The second argument of DOMDocument::createElement()
is broken - it only escapes partly and it is not part of the W3C DOM standard. In DOM the text content is a node. You can just create it and append it to the element node. This works with other node types like CDATA sections or comments as well. DOMNode::appendChild()
returns the appended node, so you can nest and chain the calls.
Additionally you can set the DOMElement::$textContent
property. This will replace all descendant nodes with a single text node. Do not use DOMElement::$nodeValue
- it has the same problems as the argument.
$document = new DOMDocument();
$document->formatOutput = true;
$root = $document->appendChild($document->createElement('foo'));
$root
->appendChild($document->createElement('one'))
->appendChild($document->createTextNode('"foo" & <bar>'));
$root
->appendChild($document->createElement('one'))
->textContent = '"foo" & <bar>';
$root
->appendChild($document->createElement('two'))
->appendChild($document->createCDATASection('"foo" & <bar>'));
$root
->appendChild($document->createElement('three'))
->appendChild($document->createComment('"foo" & <bar>'));
echo $document->saveXML();
Output:
<?xml version="1.0"?>
<foo>
<one>"foo" & <bar></one>
<one>"foo" & <bar></one>
<two><![CDATA["foo" & <bar>]]></two>
<three>
<!--"foo" & <bar>-->
</three>
</foo>
This will escape special characters (like &
and <
) as needed. Quotes do need to be escaped so they won't. Other special characters depend on the encoding.
$document = new DOMDocument("1.0", "UTF-8");
$document
->appendChild($document->createElement('foo'))
->appendChild($document->createTextNode('äöü'));
echo $document->saveXML();
$document = new DOMDocument("1.0", "ASCII");
$document
->appendChild($document->createElement('foo'))
->appendChild($document->createTextNode('äöü'));
echo $document->saveXML();
Output:
<?xml version="1.0" encoding="UTF-8"?>
<foo>äöü</foo>
<?xml version="1.0" encoding="ASCII"?>
<foo>äöü</foo>
PHP How to read XML encoding from DOMDocument
I was searching for something complex, when it was actually pretty simple: $encoding = $dom->encoding
.
wrong characters encoding DOMDocument php
$html = '<html>سلام</html>';
$doc = new DOMDocument();
Converting the character encoding of string $html
, to UTF-8 and then load it to the DOM, using 2 libxml
predefined constants (LIBXML_HTML_NOIMPLIED
& LIBXML_HTML_NODEFDTD
).
The first one sets HTML_PARSE_NOIMPLIED flag
, which turns off the automatic adding of implied html/body... elements (which is only avilable as of PHP 5.4.0).
The second one sets HTML_PARSE_NODEFDTD
flag, which prevents a default doctype being added when one is not found. using these constants help you manage your parsing in a more flexible manner.
$doc->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
Then you define the DOM encoding, itself (the previouse definition was for input):
$doc->encoding = 'UTF-8';
Remove leading and trailing <html>
& <body>
tags, in case you are not using libxml 2.7.7 (as of PHP >= 5.4.0):
$doc->normalizeDocument(); //Remove leading and trailing <html> & <body> tags
print $doc->saveHTML($doc->documentElement);
Have fun!
PHP DOMDocument saveHTML not encoding cyrillic correctly
The problem is with $dom->saveHTML();
, you need to add the root node as a parameter, like this:
return $dom->saveHTML((new \DOMXPath($dom))->query('/')->item(0));
The suddenly it renders the page differently, with substitution. If it does not, double check the values of $dom->encoding
and $dom->substituteEntities
, they should read UTF-8
and TRUE
.
Related Topics
How to Call a Static Method from a Class If All I Have Is a String of the Class Name
How to Fix Warning from Date() in PHP"
How to Get All Checkbox Variables Even If Not Checked from HTML to PHP
How to Get Jquery Code Completion in Netbeans
How to Force Page Not to Be Cached in PHP
Declaring a Global Variable Inside a Function
How to Decode Numeric HTML Entities in PHP
Remove Non English Characters PHP
Htmlpurifier Iframe Vimeo and Youtube Video
Multidimensional Array PHP Implode
How to Convert Between Time Zones in PHP Using the Datetime Class
Which Is Fastest in PHP- MySQL or MySQLi
Converting HTML Table to a CSV Automatically Using PHP
Session_Start() Takes Very Long Time