gsub in R with unicode replacement give different results under Windows compared with Unix?
If you're not seeing the right character on Windows, try explicitly setting the encoding
x <- gsub("<U\\+[0-9A-F]{4}>", "\u03B2", "<U+03B2>")
Encoding(x) <- "UTF-8"
x
As far as replacing all such symbols with unicode characters, i've adapted this answer to do a similar thing. Here we build the unicode character as a raw vector. Here's a helper function
trueunicode <- function(x) {
packuni<-Vectorize(function(cp) {
bv <- intToBits(cp)
maxbit <- tail(which(bv!=as.raw(0)),1)
if(maxbit < 8) {
rawToChar(as.raw(codepoint))
} else if (maxbit < 12) {
rawToChar(rev(packBits(c(bv[1:6], as.raw(c(0,1)), bv[7:11], as.raw(c(0,1,1))), "raw")))
} else if (maxbit < 17){
rawToChar(rev(packBits(c(bv[1:6], as.raw(c(0,1)), bv[7:12], as.raw(c(0,1)), bv[13:16], as.raw(c(0,1,1,1))), "raw")))
} else {
stop("too many bits")
}
})
m <- gregexpr("<U\\+[0-9a-fA-F]{4}>", x)
codes <- regmatches(x,m)
chars <- lapply(codes, function(x) {
codepoints <- strtoi(paste0("0x", substring(x,4,7)))
packuni(codepoints)
})
regmatches(x,m) <- chars
Encoding(x)<-"UTF-8"
x
}
and then we can use it like
x <- c("beta <U+03B2>", "flipped e <U+018F>!", "<U+2660> <U+2663> <U+2665> <U+2666>")
trueunicode(x)
# [1] "beta β" "flipped e Ə!" "♠ ♣ ♥ ♦"
Convert byte Encoding to unicode
How about this:
x <- "bi<df>chen Z<fc>rcher hello world <c6>"
m <- gregexpr("<[0-9a-f]{2}>", x)
codes <- regmatches(x, m)
chars <- lapply(codes, function(x) {
rawToChar(as.raw(strtoi(paste0("0x", substr(x,2,3)))), multiple = TRUE)
})
regmatches(x, m) <- chars
x
# [1] "bi\xdfchen Z\xfcrcher hello world \xc6"
Encoding(x) <- "latin1"
x
# [1] "bißchen Zürcher hello world Æ"
Note that you can't make an escaped character by pasting a "\x" to the front of a number. That "\x" really isn't in the string at all. It's just how R chooses to represent it on screen. Here use use rawToChar()
to turn a number into the character we want.
I tested this on a Mac so I had to set the encoding to "latin1" to see the correct symbols in the console. Just using a single byte like that isn't proper UTF-8.
Using awk to detect UTF-8 multibyte character
You seem to be targeting UTF-8 specifically. Indeed first multibyte character in UTF-8 encoding starts 0b11xxxxxx
and the next byte needs to be 0b10xxxxxx
where x
represents any value (from wikipedia).
So you can detect such sequence with sed
by matching the hex ranges and exit with nonzero exit status if found:
LC_ALL=C sed -n '/[\xC0-\xFF][\x80-\xBF]/q1'
Ie. match bytes in ranges [0b11000000-0b11111111][0b10000000-0b10111111]
.
I think \x??
and q
are both GNU extensions to sed
.
Related Topics
Writing Data Frame to PDF Table
How to Return 5 Topmost Values from Vector in R
Two Horizontal Bar Charts with Shared Axis in Ggplot2 (Similar to Population Pyramid)
How to Find Common Rows Between Two Dataframe in R
In R, How to Check If Two Variable Names Reference the Same Underlying Object
How to Format Data for Plotly Sunburst Diagram
How to Add a Page Break in Word Document Generated by Rstudio & Markdown
Passing Parameters to R Markdown
Use an Image as Area Fill in an R Plot
Assigning Null to a List Element in R
Subset Rows According to a Range of Time
Merge Dataframes on Matching A, B and *Closest* C
Coloring Boxplot Outlier Points in Ggplot2
Create Barplot from Data.Frame
Automate Zip File Reading in R