Convert unicode to readable characters in R
You could do something like this:
library(stringi)
string <- "<U+1042><U+1040><U+1042><U+1040> <U+1019><U+103D><U+102C>\n\n<U+1010><U+102D><U+102F><U+1004><U+1039><U+1038><U+103B><U+1015><U+100A><U+1039><U+1000><U+102D><U+102F><U+101C><U+1032> <U+1000><U+102C><U+1000><U+103C>"
cat(stri_unescape_unicode(gsub("<U\\+(....)>", "\\\\u\\1", string)))
Which results in:
၂၀၂၀ မွာ
တိုင္းျပည္ကိုလဲ ကာကြ
Can I convert Unicode into plain text in R?
When you type "<U+043C>"
it is being interpreted as a literal string of 8 characters. Whether this string is interpreted as latin-1 or UTF doesn't matter, since they both encode these literal 8 characters the same way.
What you need to do is unescape the unicode strings. The stringi
package can do this for you, but you need to do a bit of conversion first to get it in the right format. The following function should take care of it:
f <- function(x) {
x <- gsub(">", "", gsub("<U\\+", "\\\\u", x))
stringi::stri_unescape_unicode(x)
}
So you can do:
example <- c("<U+041F><U+043E><U+043A><U+0430><U+0437><U+044B>: 58025")
www <- c("<U+043C>")
f(example)
#> [1] "Показы: 58025"
f(www)
#> [1] "м"
How to generate all possible unicode characters?
There may be easier ways to do this, but here goes. The Unicode
package contains everything you need.
First we can get a list of unicode scripts and the block ranges:
library(Unicode)
uranges <- u_scripts()
Check what we've got:
head(uranges, 3)
$Adlam
[1] U+1E900..U+1E943 U+1E944..U+1E94A U+1E94B U+1E950..U+1E959 U+1E95E..U+1E95F
$Ahom
[1] U+11700..U+1171A U+1171D..U+1171F U+11720..U+11721 U+11722..U+11725 U+11726 U+11727..U+1172B U+11730..U+11739 U+1173A..U+1173B U+1173C..U+1173E U+1173F
[11] U+11740..U+11746
$Anatolian_Hieroglyphs
[1] U+14400..U+14646
Next we can convert the ranges into their sequences.
expand_uranges <- lapply(uranges, as.u_char_seq)
To get a single vector of all characters we can unlist it. This won't be easy to work with so really it would be better to keep them as a list:
all_unicode_chars <- unlist(expand_uranges)
# The Wikipedia page linked states there are 144,697 characters
length(all_unicode_chars)
[1] 144762
So seems to be all of them and the page needs updating. They are stored as integers so to print them (assuming the glyph is supported) we can do, for example, printing Japanese katakana:
intToUtf8(expand_uranges$Katakana[[1]])
[1] "ァアィイゥウェエォオカガキギクグケゲコゴサザシジスズセゼソゾタダチヂッツヅテデトドナニヌネノハバパヒビピフブプヘベペホボポマミムメモャヤュユョヨラリルレロヮワヰヱヲンヴヵヶヷヸヹヺ"
Converting from NCR to Unicode in R
1123
is the decimal equivalent of the hexadecimal 0463
, and Unicode uses hexadecimal. So in order to get a conversion, you need to strip out the non-digit characters, convert the digits to hex characters, stick a "\u" in front of them then use stri_unescape_unicode
.
This function will do all that:
ncr2uni <- function(x)
{
# Strip out non-digits and and convert remaining numbers to hex
x <- as.hexmode(as.numeric(gsub("\\D", "", x)))
# Left pad with zeros to length 4 so escape sequence is recognised as Unicode
x <- stringi::stri_pad_left(x, 4, "0")
# convert to Unicode
stringi::stri_unescape_unicode(paste0("\\u", x))
}
Now you can do
ncr2uni(c("ѣ", "Ѥ", "ѥ"))
# [1] "ѣ" "Ѥ" "ѥ"
Specify unicode characters programmatically R
library(stringi)
stri_unescape_unicode(paste0("\\u","00c3"))
#[1] "Ã"
You may also want to check out this function.
Related Topics
Change Color Median Line Ggplot Geom_Boxplot()
Inserting a New Row to Data Frame for Each Group Id
How to Create a Histogram from Aggregated Data in R
Changing Styles When Selecting and Deselecting Multiple Polygons with Leaflet/Shiny
Remove Weekend Data in a Dataframe
R, Conditionally Remove Duplicate Rows
How to Apply a Gradient Fill to a Geom_Rect Object in Ggplot2
How to Access the Name of the Variable Assigned to the Result of a Function Within the Function
How to Add Axis Text in This Negative and Positive Bars Differently Using Ggplot2
Subtract Pairs of Columns Based on Matching Column
Ggplot2: Plotting Order of Factors Within a Geom
How to Plot Igraph Community with Defined Colors
R - Delete Consecutive (Only) Duplicates
Multi Line Title in Ggplot 2 with Multiple Italicized Words
Substitute a for B and B for a in a String
Transposition of a Tibble Using Pivot_Longer() and Pivot_Wider (Tidyverse)
What's the Easiest Way to Deploy an API Incorporating R Functions