Force Character Vector Encoding from "Unknown" to "Utf-8" in R

Force character vector encoding from unknown to UTF-8 in R

The Encoding function returns unknown if a character string has a "native encoding" mark (CP-1250 in your case) or if it's in ASCII.
To discriminate between these two cases, call:

library(stringi)
stri_enc_mark(poli.dt$word)

To check whether each string is a valid UTF-8 byte sequence, call:

all(stri_enc_isutf8(poli.dt$word))

If it's not the case, your file is definitely not in UTF-8.

I suspect that you haven't forced the UTF-8 mode in the data read function (try inspecting the contents of poli.dt$word to verify this statement). If my guess is true, try:

read.csv2(file("filename", encoding="UTF-8"))

or

poli.dt$word <- stri_encode(poli.dt$word, "", "UTF-8") # re-mark encodings

If data.table still complains about the "mixed" encodings, you may want to transliterate the non-ASCII characters, e.g.:

stri_trans_general("Zażółć gęślą jaźń", "Latin-ASCII")
## [1] "Zazolc gesla jazn"

R String Encoding from unknown / ASCII to UTF-8

Thanks to @MrFlick I was able to solve the problem. Essentially, given a data frame with character columns of mixed encodings, the easiest work around was to:

df %>%
mutate_if(is.character, function(x){
x %>%
sapply(function(y){
y %>%
charToRaw %>%
rawToChar
})
})

This makes sure that all the characters are encoded in the same native encoding. This solves the issue where I was unable to load the data into elastic search due to encoding inconsistencies.

In R convert character encoding to UTF-8 (not using stringi)

iconv function may a choice.
Example if current encoding is latin1

iconv(test_string, "latin1", "UTF-8")

Trouble trying to clean a character vector in R data frame (UTF-8 encoding issue)

The following would remove the no break space.

head2007decathlon$Athlete <- gsub(pattern="<U\\+00A0>",replacement="",x=head2007decathlon$Athlete) 

Not sure how to convert the other characters. One problem could be that the codes are not exactly in a format that R sees as UTF-8.

One example:

iconv('\u008A', from="UTF-8", to="LATIN1")

this seems to have an effect, contrary to trying to convert U+008A. Although
the output is:

[1] "\x8a"

not the character you want. Hope this helps somehow.

R- Changing encoding of column in dataframe?

It appears that we can use the conv() function to convert the encoding after we convert the vector into Factor and then back to character vector. It is a bit strange to be honest.



Related Topics



Leave a reply



Submit