R on Windows: Character Encoding Hell

R on Windows: character encoding hell

Simple answer.

Sys.setlocale(locale = "Russian")

if you just want the russian language (not formats, currency):

'Sys.setlocale(category = "LC_COLLATE", locale = "Russian")'

'Sys.setlocale(category = "LC_CTYPE", locale = "Russian")'

If happen to be using Revolution R Open 3.2.2, you may need to set the locale in the Control Panel as well: otherwise - if you have RStudio - you'll see Cyrillic text in the Viewer and garbage in the console. So for example if you type a random cyrillic string and hit enter you'll get garbage output. Interestingly, Revolution R does not have the same problem with say Arabic text. If you use regular R, it seems that Sys.setlocale() is enough.

'Sys.setlocale()' was suggested by user G. Grothendieck's here: R, Windows and foreign language characters

Hebrew Encoding Hell in R and writing a UTF-8 table in Windows

Many many people have similar problems working with UTF-8 text on platforms that have 8-bit system encodings (Windows). Encoding in R can be tricky, because different methods handle encoding and conversions differently, and what appears to work fine on one platform (OS X or Linux) works poorly on another.

The problem has to do with your output connection and how Windows handles encodings and text connections. I've tried to replicate the problem using some Hebrew texts in both UTF-8 and an 8-bit encoding. We'll walk through the file reading issues as well, since there could be some snags there too.

For Tests

Created a short Hebrew language text file, encoded as UTF-8: hebrew-utf8.txt
Created a short Hebrew language text file, encoded as ISO-8859-8: hebrew-iso-8859-8.txt. (Note: You might need to tell your browser about the encoding in order to view this one properly - that's the case for Safari for instance.)

Ways to read the files

Now let's experiment. I am using Windows 7 for these tests (it actually works in OS X, my usual OS).

lines <- readLines("http://kenbenoit.net/files/hebrew-utf8.txt")
lines
## [1] "×”×¢×‘×¨×™ ×”×•× ×—×‘×¨ ×‘×§×‘×•×¦×” ×”×›× ×¢× ×™×ª ×©×œ ×©×¤×•×ª ×©×ž×™×•×ª."                                                                     
## [2] "×–×• ×”×™×ª×” ×©×¤×ª× ×©×œ ×”×™×”×•×“×™× ×ž×•×§×“×, ××‘×œ ×ž×Ÿ 586 ×œ×¤× ×”\"×¡ ×–×” ×”×ª×—×™×œ ×œ×”×™×•×ª ×ž×•×—×œ×£ ×¢×œ ×™×“×™ ×‘××¨×ž×™×ª."

That failed because it assumed the encoding was your system encoding, Windows-1252. But because no conversion occurred when you read the files, you can fix this just by setting the Encoding bit to UTF-8:

# this sets the bit for UTF-8
Encoding(lines) <- "UTF-8"
lines
## [1] "העברי הוא חבר בקבוצה הכנענית של שפות שמיות."                                          
## [2] "זו היתה שפתם של היהודים מוקדם, אבל מן 586 לפנה\"ס זה התחיל להיות מוחלף על ידי בארמית."

But better to do this when you read the file:

# this does it in one pass
lines2 <- readLines("http://kenbenoit.net/files/hebrew-utf8.txt", encoding = "UTF-8")
lines2[1]
## [1] "העברי הוא חבר בקבוצה הכנענית של שפות שמיות."
Encoding(lines2)
## [1] "UTF-8" "UTF-8"

Now look at what happens if we try to read the same text, but encoded as the 8-bit ISO Hebrew code page.

lines3 <- readLines("http://kenbenoit.net/files/hebrew-iso-8859-8.txt")
lines3[1]
## [1] "äòáøé äåà çáø á÷áåöä äëðòðéú ùì ùôåú ùîéåú."

Setting the Encoding bit is of no help here, because what was read does not map to the Unicode code points for Hebrew, and Encoding() does no actual encoding conversion, it merely sets an extra bit that can be used to tell R one of a few possible encoding values. We could have solved this by adding encoding = "ISO-8859-8" to the readLines() call. We can also convert the text after loading, using iconv():

# this will not fix things
Encoding(lines3) <- "UTF-8"
lines3[1]
## [1] "\xe4\xf2\xe1\xf8\xe9 \xe4\xe5\xe0 \xe7\xe1\xf8 \xe1\xf7\xe1\xe5\xf6\xe4 \xe4\xeb\xf0\xf2\xf0\xe9\xfa \xf9\xec \xf9\xf4\xe5\xfa \xf9\xee\xe9\xe5\xfa."
# but this will
iconv(lines3, "ISO-8859-8", "UTF-8")[1]
## [1] "העברי הוא חבר בקבוצה הכנענית של שפות שמיות."

Overall I think the method used above for lines2 is the best approach.

How to output the files, preserving encoding

Now to your question about how to write this: The safest way is to control your connection at a low level, where you can specify the encoding. Otherwise, the default is for R/Windows to choose your system encoding, which will lose the UTF-8. I thought this would work, which does work absolutely fine in OS X - and on OS X also works fine calling writeLines() just naming a text file without the textConnection.

## to write lines, use the encoding option of a connection object
f <- file("hebrew-output-UTF-8.txt", open = "wt", encoding = "UTF-8")
writeLines(lines2, f)
close(f)

But it does not work on Windows. You can see the Windows 7 results here: hebrew-output-UTF-8-file_encoding.txt.

So, here is how to do it in Windows: Once you are sure your text is encoded as UTF-8, just write it as raw bytes, without using any encoding, like this:

writeLines(lines2, "hebrew-output-UTF-8-useBytesTRUE.txt", useBytes = TRUE)

You can see the results at hebrew-output-UTF-8-useBytesTRUE.txt, which is now UTF-8 and looks correct.

Added for write.csv

Note that the only reason you would want to do this is to make the .csv file available for import into other software, such as Excel. (And good luck working with UTF-8 in Excel/Windows...) Otherwise, you should just write the data.table as binary using write(myDataFrame, file = "myDataFrame.RData"). But if you really need to output .csv, then:

How to write UTF-8 .csv files from a `data.table` in Windows

The problem with writing UTF-8 files using write.table() and write.csv() is that these open text connections, and Windows has limitations about encodings and text connections with respect to UTF-8. (This post offers a helpful explanation.) Following from an SO answer posted here, we can override this to write our own function to output UTF-8 .csv files.

This assumes that you have already set the Encoding() for any character elements to "UTF-8" (which happens upon import above for lines2).

df <- data.frame(int = 1:2, text = lines2, stringsAsFactors = FALSE)

write_utf8_csv <- function(df, file) {
    firstline <- paste('"', names(df), '"', sep = "", collapse = " , ")
    data <- apply(df, 1, function(x) {paste('"', x, '"', sep = "", collapse = " , ")})
    writeLines(c(firstline, data), file , useBytes = TRUE)
}

write_utf8_csv(df, "df_csv.txt")

When we now look at that file in non-Unicode-challenged OS, it now looks fine:

KBsMBP15-2:Desktop kbenoit$ cat df_csv.txt 
"int" , "text"
"1" , "העברי הוא חבר בקבוצה הכנענית של שפות שמיות."
"2" , "זו היתה שפתם של היהודים מוקדם, אבל מן 586 לפנה"ס זה התחיל להיות מוחלף על ידי בארמית."
KBsMBP15-2:Desktop kbenoit$ file df_csv.txt 
df_csv.txt: UTF-8 Unicode text, with CRLF line terminators

RStudio not picking the encoding I'm telling it to use when reading a file

This problem is caused by the wrong locale being set, whether inside RStudio or command-line R:

If the problem only happens in RStudio not command-line R, go to RStudio->Preferences:General, tell us what 'Default text encoding:'is set to, click 'Change' and try Windows-1252, UTF-8 or ISO8859-1('latin1') (or else 'Ask' if you always want to be prompted). Screenshot attached at bottom. Let us know which one worked!
If the problem also happens in command-line R, do the following:

Do locale -m on your Mac and tell us whether it supports CP1252 or else ISO8859-1 ('latin1')? Dump the list of supported locales if you need to. (You might as well tell us your version of MacOS while you're at it.)

For both of those locales, try to change to that locale:

# first try Windows CP1252, although that's almost surely not supported on Mac:
Sys.setlocale("LC_ALL", "pt_PT.1252") # Make sure not to omit the `"LC_ALL",` first argument, it will fail.
Sys.setlocale("LC_ALL", "pt_PT.CP1252") # the name might need to be 'CP1252'

# next try IS08859-1(/'latin1'), this works for me:
Sys.setlocale("LC_ALL", "pt_PT.ISO8859-1")

# Try "pt_PT.UTF-8" too...

# in your program, make sure the Sys.setlocale worked, sprinkle this assertion in your code before attempting to read.csv:
stopifnot(Sys.getlocale('LC_CTYPE') == "pt_PT.ISO8859-1")

That should work.
Strictly the Sys.setlocale() command should go in your ~/.Rprofile for startup, not inside your R session or source-code.
However Sys.setlocale() can fail, so just be aware of that. Also, assert Sys.getlocale() inside your setup code early and often, as I do. (really, read.csv should figure out if the encoding it uses is compatible with the locale, and warn or error if not).

Let us know which fix worked! I'm trying to document this more generally so we can figure out the correct enhance.

Screenshot of RStudio Preferences Change default text encoding menu:

Character encoding in R

I found the answer my self. The problem was with the transformantion from UTF-8 to the system locale (the default encoding in R) through fileEncoding. As I use RStudio, I just changed the default encoding to UTF-8 and removed the fileEncoding="UTF-8-BOM" from read.csv. Then, the entire csv file was read and RStudio displays all characters correctly.

R, Windows and foreign language characters

Try setting the locale. For example,

Sys.setlocale(locale = "Russian")

See ?Sys.setlocale for more information.

write.csv strange encoding in R

It may be difficult to provide you with a precise answer without sample data or reproducible code being made available. Having said that, as an initial attempt you can attempt to force export of your data with use of specific encoding for example, the code:

con<-file('filename',encoding="utf8")
write.csv(...,file=con,...)

would enable you to use the utf-8 encoding. You may also run the l10n_info() command that would provide you with information on the local encoding that you currently have:

> l10n_info()
$MBCS
[1] FALSE

$`UTF-8`
[1] FALSE

$`Latin-1`
[1] TRUE

$codepage
[1] 1252

R on Windows: Character Encoding Hell