Force character vector encoding from unknown to UTF-8 in R
The Encoding
function returns unknown
if a character string has a "native encoding" mark (CP-1250 in your case) or if it's in ASCII.
To discriminate between these two cases, call:
library(stringi)
stri_enc_mark(poli.dt$word)
To check whether each string is a valid UTF-8 byte sequence, call:
all(stri_enc_isutf8(poli.dt$word))
If it's not the case, your file is definitely not in UTF-8.
I suspect that you haven't forced the UTF-8 mode in the data read function (try inspecting the contents of poli.dt$word
to verify this statement). If my guess is true, try:
read.csv2(file("filename", encoding="UTF-8"))
or
poli.dt$word <- stri_encode(poli.dt$word, "", "UTF-8") # re-mark encodings
If data.table
still complains about the "mixed" encodings, you may want to transliterate the non-ASCII characters, e.g.:
stri_trans_general("Zażółć gęślą jaźń", "Latin-ASCII")
## [1] "Zazolc gesla jazn"
R String Encoding from unknown / ASCII to UTF-8
Thanks to @MrFlick I was able to solve the problem. Essentially, given a data frame with character columns of mixed encodings, the easiest work around was to:
df %>%
mutate_if(is.character, function(x){
x %>%
sapply(function(y){
y %>%
charToRaw %>%
rawToChar
})
})
This makes sure that all the characters are encoded in the same native
encoding. This solves the issue where I was unable to load the data into elastic search due to encoding inconsistencies.
In R convert character encoding to UTF-8 (not using stringi)
iconv function may a choice.
Example if current encoding is latin1
iconv(test_string, "latin1", "UTF-8")
Trouble trying to clean a character vector in R data frame (UTF-8 encoding issue)
The following would remove the no break space.
head2007decathlon$Athlete <- gsub(pattern="<U\\+00A0>",replacement="",x=head2007decathlon$Athlete)
Not sure how to convert the other characters. One problem could be that the codes are not exactly in a format that R sees as UTF-8.
One example:
iconv('\u008A', from="UTF-8", to="LATIN1")
this seems to have an effect, contrary to trying to convert U+008A
. Although
the output is:
[1] "\x8a"
not the character you want. Hope this helps somehow.
R- Changing encoding of column in dataframe?
It appears that we can use the conv() function to convert the encoding after we convert the vector into Factor and then back to character vector. It is a bit strange to be honest.
Related Topics
Remove Multiple Objects with Rm()
Replace Negative Values by Zero
Rstudio Rmarkdown: Both Portrait and Landscape Layout in a Single PDF
Changing Facet Label to Math Formula in Ggplot2
Min for Each Row in a Data Frame
How to Detect the Right Encoding for Read.Csv
How to Plot with a Png as Background
How Do Keep Only Unique Words Within Each String in a Vector
R Error in X$Ed:$ Operator Is Invalid for Atomic Vectors
Use Different Center Than the Prime Meridian in Plotting a World Map
Installation of Rodbc/Roracle Packages on Os X Mavericks
Convert Column Classes in Data.Table
How to Update R Packages in Default Library on Windows 7
Sum Cells of Certain Columns for Each Row
Prevent Row Names to Be Written to File When Using Write.Csv