Gsub in R with Unicode Replacement Give Different Results Under Windows Compared with Unix

gsub in R with unicode replacement give different results under Windows compared with Unix?

If you're not seeing the right character on Windows, try explicitly setting the encoding

x <- gsub("<U\\+[0-9A-F]{4}>", "\u03B2", "<U+03B2>")
Encoding(x) <- "UTF-8"
x

As far as replacing all such symbols with unicode characters, i've adapted this answer to do a similar thing. Here we build the unicode character as a raw vector. Here's a helper function

trueunicode <- function(x) {
packuni<-Vectorize(function(cp) {
bv <- intToBits(cp)
maxbit <- tail(which(bv!=as.raw(0)),1)
if(maxbit < 8) {
rawToChar(as.raw(codepoint))
} else if (maxbit < 12) {
rawToChar(rev(packBits(c(bv[1:6], as.raw(c(0,1)), bv[7:11], as.raw(c(0,1,1))), "raw")))
} else if (maxbit < 17){
rawToChar(rev(packBits(c(bv[1:6], as.raw(c(0,1)), bv[7:12], as.raw(c(0,1)), bv[13:16], as.raw(c(0,1,1,1))), "raw")))
} else {
stop("too many bits")
}
})
m <- gregexpr("<U\\+[0-9a-fA-F]{4}>", x)
codes <- regmatches(x,m)
chars <- lapply(codes, function(x) {
codepoints <- strtoi(paste0("0x", substring(x,4,7)))
packuni(codepoints)

})
regmatches(x,m) <- chars
Encoding(x)<-"UTF-8"
x
}

and then we can use it like

x <- c("beta <U+03B2>", "flipped e <U+018F>!", "<U+2660> <U+2663> <U+2665> <U+2666>")
trueunicode(x)
# [1] "beta β" "flipped e Ə!" "♠ ♣ ♥ ♦"

Convert byte Encoding to unicode

How about this:

x <- "bi<df>chen Z<fc>rcher hello world <c6>"

m <- gregexpr("<[0-9a-f]{2}>", x)
codes <- regmatches(x, m)
chars <- lapply(codes, function(x) {
rawToChar(as.raw(strtoi(paste0("0x", substr(x,2,3)))), multiple = TRUE)
})

regmatches(x, m) <- chars

x
# [1] "bi\xdfchen Z\xfcrcher hello world \xc6"

Encoding(x) <- "latin1"
x
# [1] "bißchen Zürcher hello world Æ"

Note that you can't make an escaped character by pasting a "\x" to the front of a number. That "\x" really isn't in the string at all. It's just how R chooses to represent it on screen. Here use use rawToChar() to turn a number into the character we want.

I tested this on a Mac so I had to set the encoding to "latin1" to see the correct symbols in the console. Just using a single byte like that isn't proper UTF-8.

Using awk to detect UTF-8 multibyte character

You seem to be targeting UTF-8 specifically. Indeed first multibyte character in UTF-8 encoding starts 0b11xxxxxx and the next byte needs to be 0b10xxxxxx where x represents any value (from wikipedia).

So you can detect such sequence with sed by matching the hex ranges and exit with nonzero exit status if found:

LC_ALL=C sed -n '/[\xC0-\xFF][\x80-\xBF]/q1'

Ie. match bytes in ranges [0b11000000-0b11111111][0b10000000-0b10111111].

I think \x?? and q are both GNU extensions to sed.



Related Topics



Leave a reply



Submit