How to Get a Unicode Character's Code

Get unicode value of a character

You can do it for any Java char using the one liner here:

System.out.println( "\\u" + Integer.toHexString('÷' | 0x10000).substring(1) );

But it's only going to work for the Unicode characters up to Unicode 3.0, which is why I precised you could do it for any Java char.

Because Java was designed way before Unicode 3.1 came and hence Java's char primitive is inadequate to represent Unicode 3.1 and up: there's not a "one Unicode character to one Java char" mapping anymore (instead a monstrous hack is used).

So you really have to check your requirements here: do you need to support Java char or any possible Unicode character?

How can I get a Unicode character's code?

Just convert it to int:

char registered = '®';
int code = (int) registered;

In fact there's an implicit conversion from char to int so you don't have to specify it explicitly as I've done above, but I would do so in this case to make it obvious what you're trying to do.

This will give the UTF-16 code unit - which is the same as the Unicode code point for any character defined in the Basic Multilingual Plane. (And only BMP characters can be represented as char values in Java.) As Andrzej Doyle's answer says, if you want the Unicode code point from an arbitrary string, use Character.codePointAt().

Once you've got the UTF-16 code unit or Unicode code points, both of which are integers, it's up to you what you do with them. If you want a string representation, you need to decide exactly what kind of representation you want. (For example, if you know the value will always be in the BMP, you might want a fixed 4-digit hex representation prefixed with U+, e.g. "U+0020" for space.) That's beyond the scope of this question though, as we don't know what the requirements are.

How do I get a unicode character from an id in a variable?

[ The OP said "I need to get it up to U+231F4", and I answered that. But what they meant is that they wanted to print the 143,859 Code Points defined by Unicode. See the other answer. I can't delete this now that
it's been accepted. ]

Java strings aren't made of Unicode Code Points but of UTF-16 code units. You need to use surrogate pairs for Unicode Code Points above U+FFFF. For example,

     U+0   ⇒   0x0000            ⎫
U+1 ⇒ 0x0001 ⎪
⋮ ⎬ Character in the BMP result
U+D7FE ⇒ 0xD7FE ⎪ in a single UTF-16 code unit.
U+D7FF ⇒ 0xD7FF ⎭

U+D800 ⇒ ------ ⎫
U+D801 ⇒ ------ ⎪
⋮ ⎬ Can't be encoded using UTF-16.
U+DFFE ⇒ ------ ⎪ Illegal for interchange for this reason.
U+DFFF ⇒ ------ ⎭

U+E000 ⇒ 0xE000 ⎫
U+E001 ⇒ 0xE001 ⎪
⋮ ⎬ Character in the BMP result
U+FFFE ⇒ 0xFFFE ⎪ in a single UTF-16 code unit.
U+FFFF ⇒ 0xFFFF ⎭

U+10000 ⇒ 0xD800, 0xDC00 ⎫
U+10001 ⇒ 0xD800, 0xDC01 ⎪
⋮ ⎬ Those outside result in two.
U+231F2 ⇒ 0xD84C, 0xDDF2 ⎪
U+231F3 ⇒ 0xD84C, 0xDDF3 ⎭

U+231F4 ⇒ 0xD84C, 0xDDF4 ⎫
U+231F5 ⇒ 0xD84C, 0xDDF5 ⎪
⋮ ⎬ We don't care about these.
U+10FFFE ⇒ 0xDBFF, 0xDFFE ⎪
U+10FFFF ⇒ 0xDBFF, 0xDFFF ⎭

For the details on surrogate pairs, you can consult the Wikipedia page for UTF-16.

Solution 1: printf %c

These details don't matter because we can use printf %c encode a Unicode Code Point into UTF-16 code units. (Kudos to @VGR.)

for (int cp=0; cp<0x231F4; ++cp) {
if (cp < 0xD800 || cp >= 0xE000) {
System.out.printf("%c%n", cp);
}
}

Optimized:

for (int cp=0; cp<0xD800; ++cp) {
System.out.println((char)cp);
}

for (int cp=0xE000; cp<0x10000; ++cp) {
System.out.println((char)cp);
}

for (int cp=0x10000; cp<0x231F4; ++cp) {
System.out.printf("%c%n", cp);
}

Solution 2: Character.toChars

Alternatively, we can use Character.toChars(codePoint) to produce a char[] containing the UTF-16 code units of a Unicode Code Point.

for (int cp=0; cp<0x231F4; ++cp) {
if (cp < 0xD800 || cp >= 0xE000) {
System.out.println(Character.toChars(cp));
}
}

Optimized:

for (int cp=0; cp<0xD800; ++cp) {
System.out.println((char)cp);
}

for (int cp=0xE000; cp<0x10000; ++cp) {
System.out.println((char)cp);
}

for (int cp=0x10000; cp<0x231F4; ++cp) {
System.out.println(Character.toChars(cp));
}

I believe the above still creates a lot of arrays. Implementing the conversion yourself avoids that and should thus be even faster.

// Up to but excluding U+231F4 ⇒ 0xD84C, 0xDDF4

for (int cp=0; cp<0xD800; ++cp) {
System.out.println((char)cp);
}

for (int cp=0xE000; cp<0x10000; ++cp) {
System.out.println((char)cp);
}

char pair[2];
for (int hisurro=0xD800; hisurro<0xD84C; ++hisurro)
pair[0] = (char)hisurro;
for (int losurro=0xDC00; losurro<0xE000; ++losurro)
pair[1] = (char)losurro;
System.out.println(pair);
}
}

pair[0] = 0xD84C;
for (int losurro=0xDC00; losurro<0xDDF4; ++losurro)
pair[1] = (char)losurro;
System.out.println(pair);
}

Note that result is not going to be entirely readable in your terminal. The output includes non-printable characters (e.g. control characters), marks (which combine with other characters), unassigned Code Points, private use Code Points, etc.

How to get the Unicode code point for a character in Javascript?

Javascript strings have a method codePointAt which gives you the integer representing the Unicode point value. You need to use a base 16 (hexadecimal) representation of that number if you wish to format the integer into a four hexadecimal digits sequence (as in the response of Nikolay Spasov).

var hex = "▄".codePointAt(0).toString(16);
var result = "\\u" + "0000".substring(0, 4 - hex.length) + hex;

However it would probably be easier for you to check directly if you key code point integer match the expected code point

oEvent.key.codePointAt(0) === '▄'.codePointAt(0);

Note that "symbol equality" can actually be trickier: some symbols are defined by surrogate pairs (you can see it as the combination of two halves defined as four hexadecimal digits sequence).

For this reason I would recommend to use a specialized library.

you'll find more details in the very relevant article by Mathias Bynens

How can I get the Unicode value of a character in go?

Strings are utf8 encoded, so to decode a character from a string to get the rune (unicode code point), you can use the unicode/utf8 package.

Example:

package main

import (
"fmt"
"unicode/utf8"
)

func main() {
str := "AÅÄÖ"

for len(str) > 0 {
r, size := utf8.DecodeRuneInString(str)
fmt.Printf("%d %v\n", r, size)

str = str[size:]
}
}

Result:

65 1

197 2

196 2

214 2

Edit: (To clarify Michael's supplement)

A character such as Ä may be created using different unicode code points:

Precomposed: Ä (U+00C4)

Using combining diaeresis: A (U+0041) + ¨ (U+0308)

In order to get the precomposed form, one can use the normalization package, golang.org/x/text/unicode/norm. The NFC (Canonical Decomposition,
followed by Canonical Composition) form will turn U+0041 + U+0308 into U+00C4:

c := "\u0041\u0308"
r, _ := utf8.DecodeRune(norm.NFC.Bytes([]byte(c)))
fmt.Printf("%+q", r) // '\u00c4'

How to get the Unicode character from a code point variable?

All you need is a \ before u05e2. To print a Unicode character, you must provide a unicode format string.

a = '\u05e2'
print(u'{}'.format(a))

#Output
ע

When you try the other approach by printing the \ within the print() function, Python first escapes the \ and does not show the desired result.

a = 'u05e2'
print(u'\{}'.format(a))

#Output
\u05e2

A way to verify the validity of Unicode format strings is using the ord() built-in function in the Python standard library. This returns the Unicode code point(an integer) of the character passed to it. This function only expects either a Unicode character or a string representing a Unicode character.

a = '\u05e2'
print(ord(a)) #1506, the Unicode code point for the Unicode string stored in a

To print the Unicode character for the above Unicode code value(1506), use the character type formatting with c. This is explained in the Python docs.

print('{0:c}'.format(1506))

#Output
ע

If we pass a normal string literal to ord(), we get an error. This is because this string does not represent a Unicode character.

a = 'u05e2'
print(ord(a))

#Error
TypeError: ord() expected a character, but string of length 5 found

C get unicode code point for character

In the first place, there are few corrections in your code.

#include <stdio.h>

int main()
{
char *a = "ā";
int n = 0; //Initialize n with zero.
while(a[n] != '\0')
{
printf("%x", a[n]);
n+=1;
}
//\u will not work. To print hexadecimal value, use \x
printf("\n %X\n\", 0xC481);
return 0;
}

Here, you are trying to print hex value of each byte. This will be not a Unicode value of character beyond 0xff.

unsigned short is the most common data structure used to store Unicode value although it cannot store all the code points. If you need to store all the Unicode points as it is, then use int which must be 32-bit.

Unicode value of a character is numeric value of each character when it is represented in UTF-32. Otherwise, you will have to compute from the byte sequence if encoding is UTF-8 or UTF-16.



Related Topics



Leave a reply



Submit