What's the Difference Between Hex Code (\X) and Unicode (\U) Chars

What's the difference between hex code (\x) and unicode (\u) chars?

The escape sequence \xNN inserts the raw byte NN into a string, whereas \uNN inserts the UTF-8 bytes for the Unicode code point NN into a UTF-8 string:

> charToRaw('\xA3')
[1] a3
> charToRaw('\uA3')
[1] c2 a3

These two types of escape sequence cannot be mixed in the same string:

> '\ua3\xa3'
Error: mixing Unicode and octal/hex escapes in a string is not allowed

This is because the escape sequences also define the encoding of the string. A \uNN sequence explicitly sets the encoding of the entire string to "UTF-8", whereas \xNN leaves it in the default "unknown" (aka. native) encoding:

> Encoding('\xa3')
[1] "unknown"
> Encoding('\ua3')
[1] "UTF-8"

This becomes important when printing strings, as they need to be converted into the appropriate output encoding (e.g., that of your console). Strings with a defined encoding can be converted appropriately (see enc2native), but those with an "unknown" encoding are simply output as-is:

  • On Linux, your console is probably expecting UTF-8 text, and as 0xA3 is not a valid UTF-8 sequence, it gives you "�".
  • On Windows, your console is probably expecting Windows-1252 text, and as 0xA3 is the correct encoding for "£", that's what you see. (When the string is \uA3, a conversion from UTF-8 to Windows-1252 takes place.)

If the encoding is set explicitly, the appropriate conversion will take place on Linux:

> s <- '\xa3'
> Encoding(s) <- 'latin1'
> cat(s)
£

What is the difference between using \u and \x while representing character literal

I would strongly recommend only using \u, as it's much less error-prone.

\x consumes 1-4 characters, so long as they're hex digits - whereas \u must always be followed by 4 hex digits. From the C# 5 specification, section 2.4.4.4, the grammar for \x:

hexadecimal-escape-sequence:
  \x hex-digit hex-digitopt hex-digitopt hex-digitopt

So for example:

string good = "Tab\x9Good compiler";
string bad = "Tab\x9Bad compiler";

... look similar but are very different strings, as the latter is effectively "Tab" followed by U+9BAD followed by " compiler".

Personally I wish the C# language had never included \x, but there we go.

Note that there's also \U, which is always followed by 8 hex digits, primarily used for non-BMP characters.

There's one other big difference between \u and \x: the latter is only used in character and string literals, whereas \u can also be used in identifiers:

string x = "just a normal string";
Console.WriteLine(\u0078); // Still refers to the identifier x

Get the Unicode from a Hexadecimal

You can simply use System.out.printf or String.format to do what you want.

Example:

int decimal = 122;

System.out.printf("Hexadecimal: %X\n", decimal);
System.out.printf("Unicode: u%04X\n", decimal);
System.out.printf("Latin small letter: %c\n", (char)decimal);

Output:

Hexadecimal: 7A
Unicode: u007A
Latin small letter: z


Related Topics



Leave a reply



Submit