Php: Using Domdocument Whenever I Try to Write Utf-8 It Writes the Hexadecimal Notation of It

PHP: using DOMDocument whenever I try to write UTF-8 it writes the hexadecimal notation of it

Ok, here you go:

$dom = new DOMDocument('1.0', 'utf-8');
$dom->appendChild($dom->createElement('root'));
$dom->documentElement->appendChild(new DOMText('ירושלים'));
echo $dom->saveXml();

will work fine, because in this case, the document you constructed will retain the encoding specified as the second argument:

<?xml version="1.0" encoding="utf-8"?>
<root>ירושלים</root>

However, once you load XML into a Document that does not specify an encoding, you will lose anything you declared in the constructor, which means:

$dom = new DOMDocument('1.0', 'utf-8');
$dom->loadXml('<root/>'); // missing prolog
$dom->documentElement->appendChild(new DOMText('ירושלים'));
echo $dom->saveXml();

will not have an encoding of utf-8:

<?xml version="1.0"?>
<root>ירושלים</root>

So if you loadXML something, make sure it is

$dom = new DOMDocument();
$dom->loadXml('<?xml version="1.0" encoding="utf-8"?><root/>');
$dom->documentElement->appendChild(new DOMText('ירושלים'));
echo $dom->saveXml();

and it will work as expected.

As an alternative, you can also specify the encoding after loading the document.

DOMDocument with XPath Encoding problems. [?] A lot of tests

The problem is that you need to tell DOMDocument what the encoding is as the HTML is parsed. You can't do this by setting the encoding option. (I believe that affects how the document is output with saveHTML.)

The slightly hackish way to do this is to insert a statement of the encoding into the document. You can do this simply by inserting '<?xml encoding="UTF-8">' before the HTML you are parsing.

<?php

$msg = "<body><a>áéíóú☻♥♦♣</a></body>";
$temp_dom = new DOMDocument();

$temp_dom->loadHTML('<?xml encoding="UTF-8">' . $msg);
$temp_dom->encoding = 'UTF-8';
$dom_xpath = new DOMXpath($temp_dom);
$ele = $dom_xpath->query('//a')->item(0);

echo "<pre>";
echo "Original: $msg\n";
echo $ele->nodeValue;
echo "</pre>";

Output:

<pre>Original: <body><a>áéíóú☻♥♦♣</a></body>
áéíóú☻♥♦♣</pre>

Note, however, that this does insert an extra node as a child of the document object (a DOMProcessingInstruction to be precise), so be aware of this if you are doing anything with $temp_dom->childNodes or suchlike.

wrong characters encoding DOMDocument php

$html = '<html>سلام</html>';
$doc = new DOMDocument();

Converting the character encoding of string $html, to UTF-8 and then load it to the DOM, using 2 libxml predefined constants (LIBXML_HTML_NOIMPLIED & LIBXML_HTML_NODEFDTD).

The first one sets HTML_PARSE_NOIMPLIED flag, which turns off the automatic adding of implied html/body... elements (which is only avilable as of PHP 5.4.0).

The second one sets HTML_PARSE_NODEFDTD flag, which prevents a default doctype being added when one is not found. using these constants help you manage your parsing in a more flexible manner.

$doc->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

Then you define the DOM encoding, itself (the previouse definition was for input):

$doc->encoding = 'UTF-8';

Remove leading and trailing <html> & <body> tags, in case you are not using libxml 2.7.7 (as of PHP >= 5.4.0):

$doc->normalizeDocument(); //Remove leading and trailing <html> & <body> tags
print $doc->saveHTML($doc->documentElement);

Have fun!

How do I make the ö character appear properly in an XML file created via PHP?

Without knowing what encoding you have in your database and what encoding you want in your XML output it's hard to be specific, but the iconv function could be useful to do the conversion.

Also. you should really consider using an XML DOM instead of outputting xml-as-plaintext with echo. Check out for example Reading and writing the XML DOM with PHP
. If you don't, you will most likely end up with other strange problems with your xml output down the road.

Trust me, I've been there. :-)

php XML DOM translates special chars to &#xYY;

The easiest way to fix this is to set the encoding type after you have loaded the XML:

$dom = new DOMDocument();
$dom->loadXML($data);
$dom->encoding = 'utf-8';

echo $dom->saveXML();
exit();

You can also fix it by putting an XML declaration at the beginning of your data:

$data = '<?xml version="1.0" encoding="utf-8"?>' . $data;
$dom = new DOMDocument();
$dom->loadXML($data);

echo $dom->saveXML();
exit();

Clean hexadecimal entities in a XML doc via PHP

The numeric entities are added by SimpleXML because your XML document has no declared encoding:

// with declared encoding :
$xml = simplexml_load_string('<?xml version="1.0" encoding="utf-8"?><x></x>');
$xml->addChild('PROD_DESC', "à");
// result: <PROD_DESC>à</PROD_DESC>

// without declared encoding :
$xml = simplexml_load_string('<?xml version="1.0"?><x></x>');
$xml->addChild('PROD_DESC', "à");
// result: <PROD_DESC>à</PROD_DESC>


Related Topics



Leave a reply



Submit