Print Unicode Character String in R

Specify unicode characters programmatically R


library(stringi)
stri_unescape_unicode(paste0("\\u","00c3"))
#[1] "Ã"

You may also want to check out this function.

How to generate all possible unicode characters?

There may be easier ways to do this, but here goes. The Unicode package contains everything you need.

First we can get a list of unicode scripts and the block ranges:

library(Unicode)  

uranges <- u_scripts()

Check what we've got:

head(uranges, 3)

$Adlam
[1] U+1E900..U+1E943 U+1E944..U+1E94A U+1E94B U+1E950..U+1E959 U+1E95E..U+1E95F

$Ahom
[1] U+11700..U+1171A U+1171D..U+1171F U+11720..U+11721 U+11722..U+11725 U+11726 U+11727..U+1172B U+11730..U+11739 U+1173A..U+1173B U+1173C..U+1173E U+1173F
[11] U+11740..U+11746

$Anatolian_Hieroglyphs
[1] U+14400..U+14646

Next we can convert the ranges into their sequences.

expand_uranges <- lapply(uranges, as.u_char_seq)

To get a single vector of all characters we can unlist it. This won't be easy to work with so really it would be better to keep them as a list:

all_unicode_chars <- unlist(expand_uranges)

# The Wikipedia page linked states there are 144,697 characters
length(all_unicode_chars)
[1] 144762

So seems to be all of them and the page needs updating. They are stored as integers so to print them (assuming the glyph is supported) we can do, for example, printing Japanese katakana:

intToUtf8(expand_uranges$Katakana[[1]])

[1] "ァアィイゥウェエォオカガキギクグケゲコゴサザシジスズセゼソゾタダチヂッツヅテデトドナニヌネノハバパヒビピフブプヘベペホボポマミムメモャヤュユョヨラリルレロヮワヰヱヲンヴヵヶヷヸヹヺ"

Display unicode in R

You could use

as.character(parse(text=shQuote(gsub("<U\\+([A-Z0-9]+)>", "\\\\u\\1", "<U+9577><U+6D32>"))))

(via)

How to Convert String contains Unicode Shortcut into TRUE Unicode character in R?

There are various ways to do this. Perhaps the easiest is to convert the hexadecimal part of your string to an integer and use intToUtf8 from base R:

mystr <- c("\\U0001F48C", "\\U0001F48D")
mystr
#> [1] "\\U0001F48C" "\\U0001F48D"

mystr <- unlist(lapply(as.list(gsub("\\\\U", "0x", mystr)), intToUtf8))
mystr
#> [1] "\U0001f48c" "\U0001f48d"

Which is probably best replaced by a little utility function:

unescape <- function(x) unlist(lapply(as.list(gsub("\\\\U", "0x", x)), intToUtf8))

D Unicode string literals: can't print specific Unicode character

I confirmed it works on my Windows box, so gonna type this up as an answer now.

In the source code, if you copy/paste the characters directly, make sure your editor is saving it in utf8 encoding. The D compiler insists on it, so if it gives a compile error about a utf thing, that's probably why. I have never used c:b but an old answer on the web said edit->encodings... it is a setting somewhere in the editor regardless.

Or, you can replace the characters in your source code with \uxxxx in the strings. Do NOT use the hexstring thing, that is for binary bytes, but your example of "\u00E0" is good, and will work for any type of string (not just wstring like in your example).

Then, on the output side, it depends on your target because the program just outputs bytes, and it is up to the recipient program to interpret it correctly. Since you said you are on Windows, the key is to set the console code page to utf-8 so it knows what you are trying to do. Indeed, the same C function can be called from D too. Leading to this program:

import core.sys.windows.windows;
import std.stdio;

void main() {
SetConsoleOutputCP(65001);
writeln("Hi \u00E0");
}

printing it successfully. On older Windows versions, you might need to change your font to see the character too (as opposed to the generic box it shows because some fonts don't have all the characters), but on my Windows 10 box, it just worked with the default font.

BTW, technically the console code page a shared setting (after running the program and it exits, you can still hit properties on your console window and see the change reflected there) and you should perhaps set it back when your program exits. You could get that at startup with the get function ( https://docs.microsoft.com/en-us/windows/console/getconsoleoutputcp ), store it in a local var, and set it back on exit. You could auto ccp = GetConsoleOutputCP(); SetConsoleOutputCP(65005;) scope(exit) SetConsoleOutputCP(ccp); right at startup - the scope exit will run when the function exits, so doing it in main would be kinda convenient. Just add some error checking if you want.

The Microsoft docs don't say anything about setting it back, so it probably doesn't actually matter, but still I wanna mention it just in case. But also the knowledge that it is shared and persists can help in debugging - if it works after you comment it, it isn't because the code isn't necessary, it is just because it was set previously and not unset yet!

Note that running it from an IDE might not be exactly the same, because IDEs often pipe the output instead of running it right out to the Windows console. If that happens, lemme know and we can type up some stuff about that for future readers too. But you can also open your own copy of the console (run the program outside the IDE) and it should show correctly for you.



Related Topics



Leave a reply



Submit