Creating Unicode Character from Its Number

Creating Unicode character from its number

Just cast your int to a char. You can convert that to a String using Character.toString():

String s = Character.toString((char)c);

EDIT:

Just remember that the escape sequences in Java source code (the \u bits) are in HEX, so if you're trying to reproduce an escape sequence, you'll need something like int c = 0x2202.

How can I get a Unicode character from a number?

Use wchar_t instead

for (wchar_t i = 0; i < 500; i++) {
wcout << i;
}

You can also use char16_t and char32_t if you're using C++11 or newer

However you still need a capable terminal and also need to set the correct codepage to get the expected output. On Linux it's quite straight forward but if you're using (an older) Windows it's much trickier. See Output unicode strings in Windows console app

How to put Unicode char in Java String?

The UTF-16 encoding of your character U+1F604 is 0xD83D 0xDE04, so it should be:

String s = "\uD83D\uDE04";

How to generate all possible unicode characters?

There may be easier ways to do this, but here goes. The Unicode package contains everything you need.

First we can get a list of unicode scripts and the block ranges:

library(Unicode)  

uranges <- u_scripts()

Check what we've got:

head(uranges, 3)

$Adlam
[1] U+1E900..U+1E943 U+1E944..U+1E94A U+1E94B U+1E950..U+1E959 U+1E95E..U+1E95F

$Ahom
[1] U+11700..U+1171A U+1171D..U+1171F U+11720..U+11721 U+11722..U+11725 U+11726 U+11727..U+1172B U+11730..U+11739 U+1173A..U+1173B U+1173C..U+1173E U+1173F
[11] U+11740..U+11746

$Anatolian_Hieroglyphs
[1] U+14400..U+14646

Next we can convert the ranges into their sequences.

expand_uranges <- lapply(uranges, as.u_char_seq)

To get a single vector of all characters we can unlist it. This won't be easy to work with so really it would be better to keep them as a list:

all_unicode_chars <- unlist(expand_uranges)

# The Wikipedia page linked states there are 144,697 characters
length(all_unicode_chars)
[1] 144762

So seems to be all of them and the page needs updating. They are stored as integers so to print them (assuming the glyph is supported) we can do, for example, printing Japanese katakana:

intToUtf8(expand_uranges$Katakana[[1]])

[1] "ァアィイゥウェエォオカガキギクグケゲコゴサザシジスズセゼソゾタダチヂッツヅテデトドナニヌネノハバパヒビピフブプヘベペホボポマミムメモャヤュユョヨラリルレロヮワヰヱヲンヴヵヶヷヸヹヺ"

How can I get a Unicode character's code?

Just convert it to int:

char registered = '®';
int code = (int) registered;

In fact there's an implicit conversion from char to int so you don't have to specify it explicitly as I've done above, but I would do so in this case to make it obvious what you're trying to do.

This will give the UTF-16 code unit - which is the same as the Unicode code point for any character defined in the Basic Multilingual Plane. (And only BMP characters can be represented as char values in Java.) As Andrzej Doyle's answer says, if you want the Unicode code point from an arbitrary string, use Character.codePointAt().

Once you've got the UTF-16 code unit or Unicode code points, both of which are integers, it's up to you what you do with them. If you want a string representation, you need to decide exactly what kind of representation you want. (For example, if you know the value will always be in the BMP, you might want a fixed 4-digit hex representation prefixed with U+, e.g. "U+0020" for space.) That's beyond the scope of this question though, as we don't know what the requirements are.

Integer Object (Unicode) to Characterobject

The problem is that a Character object represents a char; i.e. a number on the range 0 through 0xffff. Unicode code-points range up to U+10FFFF and many cannot be represented as a single char value.

So this gives you a problem:

  • If the code-points that you want to represent are all between U+0000 and U+FFFF, then you can represent them as Character values.

  • If any are U+10000 or larger, then it won't work.

So, if you have an int that represents a Unicode code-point, you need to do do something like this:

int value = ...

if (Character.isDefined(value)) {
if (value <= 0xffff) {
return Character.valueOf((char) value);
} else {
// code point not representable as a `Character`
}
} else {
// Not a valid code-point at all
}

Note:

  1. int values that are not valid code points include negative values, values greater than 0x10ffff and lower and upper surrogate code-units.
  2. A number of commonly used Unicode code-points are great than U+10000. For example, the code-points for Emojis! This means that using Character is a bad idea. It would be better to use either a String, a char[] or an Integer.


It seems to work so far.

I guess you haven't tried @Shawn's approach with an Emoji yet. /p>

Is there a way around using a downcast?

No.

if(i == 0)
throw new NullPointerException();

That is just wrong:

  1. Zero is a valid code-point.

  2. Even if it wasn't valid, it is NOT a null. So throwing NullPointerException is totally inappropriate.

  3. If you are concerned about the case where i is null, don't worry. Any operation that unboxes i will automatically throw NullPointerException if it is null. Just let it happen ...

Java: Convert String \uFFFF into char

char c = "\uFFFF".toCharArray()[0];

The value is directly interpreted as the desired string, and the whole sequence is realized as a single character.

Another way, if you are going to hard-code the value:

char c = '\uFFFF';

Note that \uFFFF doesn't seem to be a proper unicode character, but try with \u041f for example.

Read about unicode escapes here

Get unicode value of a character

You can do it for any Java char using the one liner here:

System.out.println( "\\u" + Integer.toHexString('÷' | 0x10000).substring(1) );

But it's only going to work for the Unicode characters up to Unicode 3.0, which is why I precised you could do it for any Java char.

Because Java was designed way before Unicode 3.1 came and hence Java's char primitive is inadequate to represent Unicode 3.1 and up: there's not a "one Unicode character to one Java char" mapping anymore (instead a monstrous hack is used).

So you really have to check your requirements here: do you need to support Java char or any possible Unicode character?



Related Topics



Leave a reply



Submit