Unicode Characters Conversion in R

Convert unicode to readable characters in R

You could do something like this:

library(stringi)

string <- "<U+1042><U+1040><U+1042><U+1040> <U+1019><U+103D><U+102C>\n\n<U+1010><U+102D><U+102F><U+1004><U+1039><U+1038><U+103B><U+1015><U+100A><U+1039><U+1000><U+102D><U+102F><U+101C><U+1032> <U+1000><U+102C><U+1000><U+103C>" 

cat(stri_unescape_unicode(gsub("<U\\+(....)>", "\\\\u\\1", string)))

Which results in:

၂၀၂၀ မွာ

တိုင္းျပည္ကိုလဲ ကာကြ

Can I convert Unicode into plain text in R?

When you type "<U+043C>" it is being interpreted as a literal string of 8 characters. Whether this string is interpreted as latin-1 or UTF doesn't matter, since they both encode these literal 8 characters the same way.

What you need to do is unescape the unicode strings. The stringi package can do this for you, but you need to do a bit of conversion first to get it in the right format. The following function should take care of it:


f <- function(x) {
  
   x <- gsub(">", "", gsub("<U\\+", "\\\\u", x))
   stringi::stri_unescape_unicode(x)
}

So you can do:

example <- c("<U+041F><U+043E><U+043A><U+0430><U+0437><U+044B>: 58025")
www <- c("<U+043C>")

f(example)
#> [1] "Показы: 58025"

f(www)
#> [1] "м"

How to generate all possible unicode characters?

There may be easier ways to do this, but here goes. The Unicode package contains everything you need.

First we can get a list of unicode scripts and the block ranges:

library(Unicode)  

uranges <- u_scripts()

Check what we've got:

head(uranges, 3)

$Adlam
[1] U+1E900..U+1E943 U+1E944..U+1E94A U+1E94B          U+1E950..U+1E959 U+1E95E..U+1E95F

$Ahom
 [1] U+11700..U+1171A U+1171D..U+1171F U+11720..U+11721 U+11722..U+11725 U+11726          U+11727..U+1172B U+11730..U+11739 U+1173A..U+1173B U+1173C..U+1173E U+1173F         
[11] U+11740..U+11746

$Anatolian_Hieroglyphs
[1] U+14400..U+14646

Next we can convert the ranges into their sequences.

expand_uranges <- lapply(uranges, as.u_char_seq)

To get a single vector of all characters we can unlist it. This won't be easy to work with so really it would be better to keep them as a list:

all_unicode_chars <- unlist(expand_uranges)

# The Wikipedia page linked states there are 144,697 characters 
length(all_unicode_chars)
[1] 144762

So seems to be all of them and the page needs updating. They are stored as integers so to print them (assuming the glyph is supported) we can do, for example, printing Japanese katakana:

intToUtf8(expand_uranges$Katakana[[1]])

[1] "ァアィイゥウェエォオカガキギクグケゲコゴサザシジスズセゼソゾタダチヂッツヅテデトドナニヌネノハバパヒビピフブプヘベペホボポマミムメモャヤュユョヨラリルレロヮワヰヱヲンヴヵヶヷヸヹヺ"

Converting from NCR to Unicode in R

1123 is the decimal equivalent of the hexadecimal 0463, and Unicode uses hexadecimal. So in order to get a conversion, you need to strip out the non-digit characters, convert the digits to hex characters, stick a "\u" in front of them then use stri_unescape_unicode.

This function will do all that:

ncr2uni <- function(x)
{
  # Strip out non-digits and and convert remaining numbers to hex
  x <- as.hexmode(as.numeric(gsub("\\D", "", x)))

  # Left pad with zeros to length 4 so escape sequence is recognised as Unicode 
  x <- stringi::stri_pad_left(x, 4, "0")

  # convert to Unicode
  stringi::stri_unescape_unicode(paste0("\\u", x))
}

Now you can do

ncr2uni(c("ѣ", "Ѥ", "ѥ"))
# [1] "ѣ" "Ѥ" "ѥ"

Specify unicode characters programmatically R

library(stringi)
stri_unescape_unicode(paste0("\\u","00c3"))
#[1] "Ã"

You may also want to check out this function.