How to Convert \Uxxxx Unicode to Utf-8 Using Console Tools in *Nix

How to convert \uXXXX unicode to UTF-8 using console tools in *nix

I don't know which distribution you are using, but uni2ascii should be included.

$ sudo apt-get install uni2ascii

It only depend on libc6, so it's a lightweight solution (uni2ascii i386 4.18-2 is 55,0 kB on Ubuntu)!

Then to use it:

$ echo 'Character 1: \u0144, Character 2: \u00f3' | ascii2uni -a U -q
Character 1: ń, Character 2: ó

Convert unicode symbols to \uXXXX, not using json_encode

Since your current solution uses the u regex modifier, I'm assuming your input is encoded as UTF-8.

The following solution is definitely not simpler (apart from the regex) and I'm not even sure it's faster, but it's more low-level and shows the actual escaping procedure.

$input = preg_replace_callback('#[^\x00-\x7f]#u', function($m) {
$utf16 = mb_convert_encoding($m[0], 'UTF-16BE', 'UTF-8');
if (strlen($utf16) <= 2) {
$esc = '\u' . bin2hex($utf16);
}
else {
$esc = '\u' . bin2hex(substr($utf16, 0, 2)) .
'\u' . bin2hex(substr($utf16, 2, 2));
}
return $esc;
}, $input);

One fundamental problem is that PHP doesn't have an ord function that works with UTF-8. You either have to use mb_convert_encoding, or you have to roll your own UTF-8 decoder (see linked question) which would allow for additional optimizations. Two- and three-byte UTF-8 sequences map to a single UTF-16 code unit. Four-byte sequences require two code units (high and low surrogate).

If you're aiming for simplicity and readability, you probably can't beat the json_encode approach.

Convert unicode array to string in bash

The thing that's messing you up is that your array is full of the decimal numbers of the codepoints, but the \U notation takes hexidecimal numbers. So for example, the first element in the array is "72" -- in decimal, that's the code for "H", but in hex it's equivalent to decimal 114, which is the code for "r".

So to use \U notation, you first need to convert the numbers to hex, which you can do with printf %x:

for c in "${array[@]}"; do
temp+="\\U$(printf %x "$c")" # Convert dec->hex, add \U
done
printf %b "$temp" # Convert \U<codepoint> to actual characters

As dave_thompson_085 pointed out in a comment, you can simplify this even further by converting the entire array with a single printf:

printf %b "$(printf '\\U%x' "${array[@]}")"

encoding conversion tool

iconv

Curl replacing \u in response to \\u in c++

Your analysis is wrong. Libcurl is not escaping anything. Load the URL in a web browser of your choosing and look at the raw data that is actually being sent. For example, this is what I see in Firefox:

raw data

The server really is sending Mu\u1e25arram, not Muḥarram like you are expecting. And this is perfectly fine, because the server is sending back JSON data, and JSON is allowed to escape Unicode characters like this. Read the JSON spec, particularly Section 9 on how Unicode codepoints may be encoded using hexidecimal escape sequences (which is optional in JSON, but still allowed). \u1e25 is simply the JSON hex-escaped form of .

You are merely printing out the JSON content as-is, exactly as the server sent it. You are not actually parsing it at all. If you were to use an actual JSON parser, Mu\u1e25arram would be decoded to Muḥarram for you. For example, here is how Firefox parses the JSON:

parsed data

It is not libcurl's job to decode JSON data. Its job is merely to give you the data that the server sends. It is your job to interpret the data afterwards as needed.



Related Topics



Leave a reply



Submit