Convert unicode to readable characters in R
You could do something like this:
library(stringi)
string <- "<U+1042><U+1040><U+1042><U+1040> <U+1019><U+103D><U+102C>\n\n<U+1010><U+102D><U+102F><U+1004><U+1039><U+1038><U+103B><U+1015><U+100A><U+1039><U+1000><U+102D><U+102F><U+101C><U+1032> <U+1000><U+102C><U+1000><U+103C>"
cat(stri_unescape_unicode(gsub("<U\\+(....)>", "\\\\u\\1", string)))
Which results in:
၂၀၂၀ မွာ
တိုင္းျပည္ကိုလဲ ကာကြ
Can I convert Unicode into plain text in R?
When you type "<U+043C>"
it is being interpreted as a literal string of 8 characters. Whether this string is interpreted as latin-1 or UTF doesn't matter, since they both encode these literal 8 characters the same way.
What you need to do is unescape the unicode strings. The stringi
package can do this for you, but you need to do a bit of conversion first to get it in the right format. The following function should take care of it:
f <- function(x) {
x <- gsub(">", "", gsub("<U\\+", "\\\\u", x))
stringi::stri_unescape_unicode(x)
}
So you can do:
example <- c("<U+041F><U+043E><U+043A><U+0430><U+0437><U+044B>: 58025")
www <- c("<U+043C>")
f(example)
#> [1] "Показы: 58025"
f(www)
#> [1] "м"
Decoding to Chinese characters in R
This code will convert the string to the appropriate Chinese characters:
library(stringi)
string <- '<U+5ECA><U+574A><U+5E02>'
cat(stri_unescape_unicode(gsub("<U\\+(....)>", "\\\\u\\1", string)))
# Output: 廊坊市
Source: Convert unicode to readable characters in R
Convert unicode to a readable string
When you assign the hex codes like \xe0\xae\xa8\xe0...
to a string, R doesn't know how they are intended to be interpreted, so it assumes the encoding for the current locale on your computer. On most modern Unix-based systems these days, that would be UTF-8, so for example on a Mac your string displays as
> x
[1] "நடிகர் சூர்யா செய்த உதவி மிக முக்கியமானது!' - டெல்டாவில் நெகிழ்ந்த சீமான்"
which I assume is the correct display. Google Translate recognizes it as being written in Tamil.
However, on Windows it displays unreadably. On my Windows 10 system, I see
> x
[1] "நடிகர௠சூரà¯à®¯à®¾ செயà¯à®¤ உதவி மிக à®®à¯à®•à¯à®•à®¿à®¯à®®à®¾à®©à®¤à¯!' - டெலà¯à®Ÿ
because it uses the code page corresponding to the Latin1 encoding, which is wrong for that string. To get it to display properly on Windows, you need to tell R that it is encoded in UTF-8 by declaring its encoding:
Encoding(x) <- "UTF-8"
Then it will display properly in Windows as well, which solves your problem.
For others trying to do this, it's important to know that there are only a few values that work this way. You can declare the encoding to be "UTF-8"
, "latin1"
, "bytes"
or "unknown"
. "unknown" means the local encoding on the machine, "bytes" means it shouldn't be interpreted as characters at all. If your string has a different encoding, you need to use a different approach: convert to one of the encodings that R knows about.
For example, the string
x <- "\xb4\xde\xd1\xe0\xde\xd5 \xe3\xe2\xe0\xde"
is Russian encoded in ISO 8859-5. On a system where that was the local encoding it would display properly, but on mine it displays using the hex codes. To get it to display properly I need to convert it to UTF-8 using
y <- iconv(x, from="ISO8859-5", to="UTF-8")
Then it will display properly as [1] "Доброе утро"
. You can see the full list of encodings that iconv()
knows about using iconvlist()
.
Related Topics
Rselenium on Docker: Where Are Files Downloaded
The Fastest Way to Convert Numeric to Character in R
Overlapped Density Plots in Ggplot2
Blockwise Sum of Matrix Elements
How to Calculate Euclidean Distance Between Two Matrices in R
Convert to Local Time Zone Using Latitude and Longitude
Ggplot: Subset a Layer Where Data Is Passed Using a Pipe
How to Keep Track of Total Transaction Amount Sent from an Account Each Last 6 Month
Total of a Column in Dt Datatables in Shiny
How to Install The Fftw3 Package of R in Ubuntu 12.04
Aws Dynamodb Support for "R" Programming Language
Tiff Plot Generation and Compression: R VS. Gimp VS. Irfanview VS. Photoshop File Sizes
Change Thickness of a Marker in Ggplot2
Finding Which Element of a Vector Is Between Two Values in R