How to Keep the Chinese or Other Foreign Language as They Are Instead of Converting Them into Codes

How to keep the Chinese or other foreign language as they are instead of converting them into codes?

DOMDocument seems to convert Chinese characters into codes [...]. How can I keep the Chinese or other foreign language as they are instead of converting them into codes?

$dom = new DOMDocument();
$dom->loadHTML($html);

If you're using the loadHTML function to load a HTML chunk. By default DOMDocument expects that string to be in HTML's default encoding (ISO-8859-1) however most often the charset (sic!) is meta-information provided next to the string you're using and not inside. To make this more complicated, that meta-information be be even inside the string.

Anyway as you have not shared the string data of the HTML and you have not specified the encoding, it's hard to tell specifically what is going on.

I assume the HTML is UTF-8 encoded but this is not signalled within the HTML string. So the following work-around can help:

$doc = new DOMDocument();
$doc->loadHTML('<?xml encoding="UTF-8">' . $html);

// dirty fix
foreach ($doc->childNodes as $item)
    if ($item->nodeType == XML_PI_NODE)
        $doc->removeChild($item); // remove hack
$doc->encoding = 'UTF-8'; // insert proper

It injects an encoding hint on the very beginning (and removes it after the HTML has been loaded). From that point on, DOMDocument will return UTF-8 (as always).

Prevent/workaround browser converting '\n' between lines into space (for Chinese characters)

Browsers treat newlines as spaces because the specifications say so, ever since HTML 2.0. In fact, HTML 2.0 was milder than later specifications; it said: “An HTML user agent should treat end of line in any of its variations as a word space in all contexts except preformatted text.” (Conventional Representation of Newlines), whereas newer specifications say this stronger (describing it as what happens in HTML).

The background is that HTML and the Web was developed with mainly Western European languages in mind; this is reflected in many features of the original specifications and early implementations. Only slowly have they been internationalized.

It is unlikely that the parsing rules will be changed. More likely, what might happen is sensitivity to language or character properties rendering. This would mean that a line break still gets taken as a space (and the DOM string will contain Ascii space character), but a string like 这是一句话。 would be rendered as if the space were not there. This what the HTML 4.01 specification seems to refer to (White space). The text is somewhat confused, but I think it tries to say that the behavior would depend in the content language, either inferred by the browser or as declared in markup.

But browsers don’t do such things yet. Declaring the language of content, e.g. <html lang=zh>, is a good principle but has little practical impact—in rendering, it may affect the browser’s choice of a default font (but how many authors let browsers use their default fonts?). It may even result in added spacing, if the space character happens to be wider in the browser’s default font for the language specified.

According to the CSS3 Text draft, you could use the text-spacing property. The value none “Turns off all text-spacing features. All fullwidth characters are set with full-width glyphs.” Unfortunately, no browser seems to support this yet.

Can I get a code page from a language preference?

FWIW, this is what I ended up doing:

#define _CONVERSION_DONT_USE_THREAD_LOCALE // force CP_ACP *not* CP_THREAD_ACP for MFC CString auto-conveters!!!
In application startup, construct the desired locale: m_locale(FStringA(".%u", GetACP()).GetString(), LC_CTYPE)
force it to agree with GetACP(): // force C++ and C libraries based on setlocale() to use system locale for narrow strings
m_locale = ::std::locale::global(m_locale); // we store the previous global so we can restore before termination to avoid memory loss

This gives me relatively ideal use of MFC's built-in narrow<->wide conversions in CString to automatically use the user's default language when converting to or from MBCS strings for the current locale.

Note: m_locale is type ::std::locale

How to encode and decode Broken Chinese/Unicode characters?

What is happening when you save the "bad" string in a text file with a meta tag declaring the correct encoding is that your text editor is saving the file with Windows-1252 encoding, but the browser is reading the file and interpreting it as UTF-8. Since the "bad" string is incorrectly decoded UTF-8 bytes with the Windows-1252 encoding, you are reversing the process by encoding the file as Windows-1252 and decoding as UTF-8.

Here's an example:

using System.Text;
using System.Windows.Forms;

namespace Demo
{
    class Program
    {
        static void Main(string[] args)
        {
            string s = "具有靜電產生裝置之影像輸入裝置"; // Unicode
            Encoding Windows1252 = Encoding.GetEncoding("Windows-1252");
            Encoding Utf8 = Encoding.UTF8;
            byte[] utf8Bytes = Utf8.GetBytes(s); // Unicode -> UTF-8
            string badDecode = Windows1252.GetString(utf8Bytes); // Mis-decode as Latin1
            MessageBox.Show(badDecode,"Mis-decoded");  // Shows your garbage string.
            string goodDecode = Utf8.GetString(utf8Bytes); // Correctly decode as UTF-8
            MessageBox.Show(goodDecode, "Correctly decoded");

            // Recovering from bad decode...
            byte[] originalBytes = Windows1252.GetBytes(badDecode);
            goodDecode = Utf8.GetString(originalBytes);
            MessageBox.Show(goodDecode, "Re-decoded");
        }
    }
}

Even with correct decoding, you'll still need a font that supports the characters being displayed. If your default font doesn't support Chinese, you still might not see the correct characters.

The correct thing to do is figure out why the string you have was decoded as Windows-1252 in the first place. Sometimes, though, data in a database is stored incorrectly to begin with and you have to resort to these games to fix the problem.

Coding in Other (Spoken) Languages

If I understood well the question actually is: "does every single coder in the world know enough English to use the exact same reserved words as I do?"

Well.. English is not the subject here but programming language reserved words. I mean, when I started about 10 yrs ago, I didn't have any clue of English, and still I was able to program simple things by learning the programming language, even when I did not know what they meant ( in English ). As a matter of fact this helped me to learn English.

For example. I know to do an "iteración" ( iteration of course ) I had to write:

 for( i = 0 ; i < 100 ; i++ ) {}

To me, the "for", the ";" and the "++" were simple foreign words or symbols. Later I learned that "for" meant "para", "while" meant "mientras", etc. But, in the meantime, I did not need to know English, what I needed was to know was "C".

Of course when I needed to learn more things, I had to learn English, for the documentation is written in that language.

So the answer is: No, I don't see if, while, for etc. in my native language. I see them in English, but they didn't mean to me any other thing that they meant for the programming language in turn.

Is like switch statement in bash: case .. esac. What Is "esac"... for me the end of the switch statement in bash.

I guess that's what we call "abstraction"

PHP DOMDocument failing to handle utf-8 characters (☆)

DOMDocument::loadHTML() expects a HTML string.

HTML uses the ISO-8859-1 encoding (ISO Latin Alphabet No. 1) as default per it's specs. That is since longer, see 6.1. The HTML Document Character Set. In reality that is more the default support for Windows-1252 in common webbrowsers.

I go back that far because PHP's DOMDocument is based on libxml and that brings the HTMLparser which is designed for HTML 4.0.

I'd say it's safe to assume then that you can load an ISO-8859-1 encoded string.

Your string is UTF-8 encoded. Turn all characters higher than 127 / h7F into HTML Entities and you're fine. If you don't want to do that your own, that is what mb_convert_encoding with the HTML-ENTITIES target encoding does:

Those characters that have named entities, will get the named entitiy. € -> €
The others get their numeric (decimal) entity, e.g. ☆ -> ☆

The following is a code example that makes the progress a bit more visible by using a callback function:

$html = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function($match) {
    list($utf8) = $match;
    $entity = mb_convert_encoding($utf8, 'HTML-ENTITIES', 'UTF-8');
    printf("%s -> %s\n", $utf8, $entity);
    return $entity;
}, $html);

This exemplary outputs for your string:

☆ -> ☆
☆ -> ☆
☆ -> ☆

Anyway, that's just for looking deeper into your string. You want to have it either converted into an encoding loadHTML can deal with. That can be done by converting all outside of US-ASCII into HTML Entities:

$us_ascii = mb_convert_encoding($utf_8, 'HTML-ENTITIES', 'UTF-8');

Take care that your input is actually UTF-8 encoded. If you have even mixed encodings (that can happen with some inputs) mb_convert_encoding can only handle one encoding per string. I already outlined above how to more specifically do string replacements with the help of regular expressions, so I leave further details for now.

The other alternative is to hint the encoding. This can be done in your case by modifying the document and adding a

<meta http-equiv="content-type" content="text/html; charset=utf-8">

which is a Content-Type specifying a charset. That is also best practice for HTML strings that are not available via a webserver (e.g. saved on disk or inside a string like in your example). The webserver normally set's that as the response header.

If you don't care the misplaced warnings, you can just add it in front of the string:

$dom = new DomDocument();
$dom->loadHTML('<meta http-equiv="content-type" content="text/html; charset=utf-8">'.$html);

Per the HTML 2.0 specs, elements that can only appear in the <head> section of a document, will be automatically placed there. This is what happens here, too. The output (pretty-print):

<!DOCTYPE html>
<html>
  <head>
    <meta http-equiv="content-type" content="text/html; charset=utf-8">
    <meta charset="utf-8">
    <title>Test!</title>
  </head>
  <body>
    <h1>☆ Hello ☆ World ☆</h1>    
  </body>
</html>

SimpleXML and Chinese

What you describe sounds like an encoding issue. Encoding is like a chain, if it get's broken at one part of the processing, the data can be damaged.

When you request the data from the RSS server, you will get the data in a specific character encoding. The first thing you should find out is the encoding of that data.

Data URL: http://tw.blog.search.yahoo.com/rss?ei=UTF-8&p=%E6%95%B8%E4%BD%8D%E6%99%82%E4%BB%A3%20%E9%9B%9C%E8%AA%8C&pvid=QAEnPXeg.ioIuO7iSzUg9wQIc1LBPk3uWh8ABnsa

According to the website headers, the encoding is UTF-8. This is the standard XML encoding.

However if the data is not UTF-8 encoded while the headers are saying so, you need to find out the correct encoding of the data and bring it into UTF-8 before you go on.

Next thing to check is if simplexml_load_string() is able to deal with UTF-8 data.

I do not use simplexml, I use DomDocument. So I can not say if or not. However I can suggest you to use DomDocument instead. It definitely supports UTF-8 for loading and all data it returns is encoded in UTF-8 as well. You should safely assume that simplexml handles UTF-8 properly as well however.

Next part of the chain is your display. You write that your data is broken. How can you say so? How do you interrogate the simplexml object?

Revisiting the Encoding Chain

As written, encoding is like a chain. If one element breaks, the overall result is damaged. To find out where it breaks, each element has to be checked on it's own. The encoding you aim for is UTF-8 here.

Input Data: All Checks OK:
- Check: Does the encoding data seems to be UTF-8? Result: Yes. The input data aquired from the data URL given, does validate the UTF-8 encoding. This could be properly tested with the data provided.
- Check: Does the raw xml data mark itself as being UTF-8 encoded? Result: Yes. This could be verified within the first bytes that are: <?xml version="1.0" encoding="UTF-8" ?>.
Simple XML Data:
- Check: Does simple_xml support the UTF-8 encoding? Result: Yes.
- Check: Does simple_xml return values in the UTF-8 encoding? Result: Yes and No. Generally simple_xml support properties containing text that is UTF-8 encoded, however a var_dump() of the simple_xml object instance with the xml data suggests that it does not support CDATA. CDATA is used in the data in question. CDATA elements will get dropped.

At this point this looks like the error you are facing. However you can convert all CDATA elements into text. To do this, you need to specify an option when loading the XML data. The option is a constant called LIBXML_NOCDATA and it will merge CDATA as text nodes.

The following is an example code I used for the tests above and demonstrates how to use the option:

$data_url = 'http://tw.blog.search.yahoo.com/rss?ei=UTF-8&p=%E6%95%B8%E4%BD%8D%E6%99%82%E4%BB%A3%20%E9%9B%9C%E8%AA%8C&pvid=QAEnPXeg.ioIuO7iSzUg9wQIc1LBPk3uWh8ABnsa';
$xml_data = file_get_contents($data_url);

$inspect = 256;
echo "First $inspect bytes out of ", count($xml_data),":\n", wordwrap(substr($xml_data, 0, $inspect)), "\n";
echo "UTF-8 test: ", var_dump(can_be_valid_utf8_statemachine($xml_data)), "\n";

$simple_xml = simplexml_load_string($xml_data, null, LIBXML_NOCDATA);
var_dump($simple_xml);

/**
 * Bitwise check a string if it would validate 
 * as utf-8.
 *
 * @param string $str
 * @return bool
 */
function can_be_valid_utf8_statemachine( $str ) { 
    $length = strlen($str); 
    for ($i=0; $i < $length; $i++) { 
        $c = ord($str[$i]); 
        if ($c < 0x80) $n = 0; # 0bbbbbbb 
        elseif (($c & 0xE0) == 0xC0) $n=1; # 110bbbbb 
        elseif (($c & 0xF0) == 0xE0) $n=2; # 1110bbbb 
        elseif (($c & 0xF8) == 0xF0) $n=3; # 11110bbb 
        elseif (($c & 0xFC) == 0xF8) $n=4; # 111110bb 
        else return false; # Does not match 
        for ($j=0; $j<$n; $j++) { # n bytes matching 10bbbbbb follow ? 
            if ((++$i == $length) || ((ord($str[$i]) & 0xC0) != 0x80)) 
                return false; 
        } 
    } 
    return true; 
}

I assume that this will fix your issue. If not DomDocument is able to handle CDATA elements. As the encoding chain is not further tested, you might still get encoding issues in the further processing of the data, so take care that you keep the encoding up to the output.

How to Keep the Chinese or Other Foreign Language as They Are Instead of Converting Them into Codes