What's the difference between hex code (\x) and unicode (\u) chars?
The escape sequence \xNN
inserts the raw byte NN
into a string, whereas \uNN
inserts the UTF-8 bytes for the Unicode code point NN
into a UTF-8 string:
> charToRaw('\xA3')
[1] a3
> charToRaw('\uA3')
[1] c2 a3
These two types of escape sequence cannot be mixed in the same string:
> '\ua3\xa3'
Error: mixing Unicode and octal/hex escapes in a string is not allowed
This is because the escape sequences also define the encoding of the string. A \uNN
sequence explicitly sets the encoding of the entire string to "UTF-8", whereas \xNN
leaves it in the default "unknown" (aka. native) encoding:
> Encoding('\xa3')
[1] "unknown"
> Encoding('\ua3')
[1] "UTF-8"
This becomes important when printing strings, as they need to be converted into the appropriate output encoding (e.g., that of your console). Strings with a defined encoding can be converted appropriately (see enc2native
), but those with an "unknown" encoding are simply output as-is:
- On Linux, your console is probably expecting UTF-8 text, and as
0xA3
is not a valid UTF-8 sequence, it gives you "�". - On Windows, your console is probably expecting Windows-1252 text, and as
0xA3
is the correct encoding for "£", that's what you see. (When the string is\uA3
, a conversion from UTF-8 to Windows-1252 takes place.)
If the encoding is set explicitly, the appropriate conversion will take place on Linux:
> s <- '\xa3'
> Encoding(s) <- 'latin1'
> cat(s)
£
What is the difference between using \u and \x while representing character literal
I would strongly recommend only using \u
, as it's much less error-prone.
\x
consumes 1-4 characters, so long as they're hex digits - whereas \u
must always be followed by 4 hex digits. From the C# 5 specification, section 2.4.4.4, the grammar for \x
:
hexadecimal-escape-sequence:
\x
hex-digit hex-digitopt hex-digitopt hex-digitopt
So for example:
string good = "Tab\x9Good compiler";
string bad = "Tab\x9Bad compiler";
... look similar but are very different strings, as the latter is effectively "Tab" followed by U+9BAD
followed by " compiler".
Personally I wish the C# language had never included \x
, but there we go.
Note that there's also \U
, which is always followed by 8 hex digits, primarily used for non-BMP characters.
There's one other big difference between \u
and \x
: the latter is only used in character and string literals, whereas \u
can also be used in identifiers:
string x = "just a normal string";
Console.WriteLine(\u0078); // Still refers to the identifier x
Get the Unicode from a Hexadecimal
You can simply use System.out.printf
or String.format
to do what you want.
Example:
int decimal = 122;
System.out.printf("Hexadecimal: %X\n", decimal);
System.out.printf("Unicode: u%04X\n", decimal);
System.out.printf("Latin small letter: %c\n", (char)decimal);
Output:
Hexadecimal: 7A
Unicode: u007A
Latin small letter: z
Related Topics
R: Expand and Fill Data Frame by Date in Series
Using Ggplot for Scattering Dots
What Is the Knitr Equivalent of 'R Cmd Sweave Myfile.Rnw'
To Find Whether a Column Exists in Data Frame or Not
Kruskal-Wallis Test with Details on Pairwise Comparisons
What Is a Fast Way to Set Debugging Code at a Given Line in a Function
How to Plot a Subset of a Data Frame in R
Manipulating Multiple Files in R
How to Know If R Is Running on 64 Bits Versus 32
Insert Portions of a Markdown Document Inside Another Markdown Document Using Knitr
Building a Box Plot from All Columns of Data Frame with Column Names on X in Ggplot2
How to Sort a Data.Frame with Only One Column, Without Losing Rownames
Compute Rolling Sum by Id Variables, with Missing Timepoints
Generate a Sequence of Characters from 'A'-'Z'