Convert Unicode to Readable Characters in R

Convert unicode to readable characters in R

You could do something like this:

library(stringi)

string <- "<U+1042><U+1040><U+1042><U+1040> <U+1019><U+103D><U+102C>\n\n<U+1010><U+102D><U+102F><U+1004><U+1039><U+1038><U+103B><U+1015><U+100A><U+1039><U+1000><U+102D><U+102F><U+101C><U+1032> <U+1000><U+102C><U+1000><U+103C>"

cat(stri_unescape_unicode(gsub("<U\\+(....)>", "\\\\u\\1", string)))

Which results in:

၂၀၂၀ မွာ

တိုင္းျပည္ကိုလဲ ကာကြ

Can I convert Unicode into plain text in R?

When you type "<U+043C>" it is being interpreted as a literal string of 8 characters. Whether this string is interpreted as latin-1 or UTF doesn't matter, since they both encode these literal 8 characters the same way.

What you need to do is unescape the unicode strings. The stringi package can do this for you, but you need to do a bit of conversion first to get it in the right format. The following function should take care of it:


f <- function(x) {

x <- gsub(">", "", gsub("<U\\+", "\\\\u", x))
stringi::stri_unescape_unicode(x)
}

So you can do:

example <- c("<U+041F><U+043E><U+043A><U+0430><U+0437><U+044B>: 58025")
www <- c("<U+043C>")

f(example)
#> [1] "Показы: 58025"

f(www)
#> [1] "м"

Decoding to Chinese characters in R

This code will convert the string to the appropriate Chinese characters:

library(stringi)
string <- '<U+5ECA><U+574A><U+5E02>'
cat(stri_unescape_unicode(gsub("<U\\+(....)>", "\\\\u\\1", string)))
# Output: 廊坊市

Source: Convert unicode to readable characters in R

Convert unicode to a readable string

When you assign the hex codes like \xe0\xae\xa8\xe0... to a string, R doesn't know how they are intended to be interpreted, so it assumes the encoding for the current locale on your computer. On most modern Unix-based systems these days, that would be UTF-8, so for example on a Mac your string displays as

> x
[1] "நடிகர் சூர்யா செய்த உதவி மிக முக்கியமானது!' - டெல்டாவில் நெகிழ்ந்த சீமான்"

which I assume is the correct display. Google Translate recognizes it as being written in Tamil.

However, on Windows it displays unreadably. On my Windows 10 system, I see

> x
[1] "நடிகர௠சூரà¯à®¯à®¾ செயà¯à®¤ உதவி மிக à®®à¯à®•à¯à®•à®¿à®¯à®®à®¾à®©à®¤à¯!' - டெலà¯à®Ÿ

because it uses the code page corresponding to the Latin1 encoding, which is wrong for that string. To get it to display properly on Windows, you need to tell R that it is encoded in UTF-8 by declaring its encoding:

Encoding(x) <- "UTF-8"

Then it will display properly in Windows as well, which solves your problem.

For others trying to do this, it's important to know that there are only a few values that work this way. You can declare the encoding to be "UTF-8", "latin1", "bytes" or "unknown". "unknown" means the local encoding on the machine, "bytes" means it shouldn't be interpreted as characters at all. If your string has a different encoding, you need to use a different approach: convert to one of the encodings that R knows about.

For example, the string

x <- "\xb4\xde\xd1\xe0\xde\xd5 \xe3\xe2\xe0\xde" 

is Russian encoded in ISO 8859-5. On a system where that was the local encoding it would display properly, but on mine it displays using the hex codes. To get it to display properly I need to convert it to UTF-8 using

y <- iconv(x, from="ISO8859-5", to="UTF-8")

Then it will display properly as [1] "Доброе утро". You can see the full list of encodings that iconv() knows about using iconvlist().



Related Topics



Leave a reply



Submit