Rstudio Not Picking the Encoding I'm Telling It to Use When Reading a File

RStudio not picking the encoding I'm telling it to use when reading a file

This problem is caused by the wrong locale being set, whether inside RStudio or command-line R:

  1. If the problem only happens in RStudio not command-line R, go to RStudio->Preferences:General, tell us what 'Default text encoding:'is set to, click 'Change' and try Windows-1252, UTF-8 or ISO8859-1('latin1') (or else 'Ask' if you always want to be prompted). Screenshot attached at bottom. Let us know which one worked!

  2. If the problem also happens in command-line R, do the following:

Do locale -m on your Mac and tell us whether it supports CP1252 or else ISO8859-1 ('latin1')? Dump the list of supported locales if you need to. (You might as well tell us your version of MacOS while you're at it.)

For both of those locales, try to change to that locale:

# first try Windows CP1252, although that's almost surely not supported on Mac:
Sys.setlocale("LC_ALL", "pt_PT.1252") # Make sure not to omit the `"LC_ALL",` first argument, it will fail.
Sys.setlocale("LC_ALL", "pt_PT.CP1252") # the name might need to be 'CP1252'

# next try IS08859-1(/'latin1'), this works for me:
Sys.setlocale("LC_ALL", "pt_PT.ISO8859-1")

# Try "pt_PT.UTF-8" too...

# in your program, make sure the Sys.setlocale worked, sprinkle this assertion in your code before attempting to read.csv:
stopifnot(Sys.getlocale('LC_CTYPE') == "pt_PT.ISO8859-1")

That should work.
Strictly the Sys.setlocale() command should go in your ~/.Rprofile for startup, not inside your R session or source-code.
However Sys.setlocale() can fail, so just be aware of that. Also, assert Sys.getlocale() inside your setup code early and often, as I do. (really, read.csv should figure out if the encoding it uses is compatible with the locale, and warn or error if not).

Let us know which fix worked! I'm trying to document this more generally so we can figure out the correct enhance.

  1. Screenshot of RStudio Preferences Change default text encoding menu:
    Sample Image

How can I detect non UTF-8 encoding in RStudio

I realized, the answer is really simple: Just go to Edit => Find (Strg + F) and search for [^\x00-\x7F] + with enabled Regex field in the search bar.

How to read a file with unknown encoding (FDF)

The encoding of the file is mixed.

Most of the PDF seems to be in latin1, as the first characters should be "%âãÏÓ". (See: PDF File header sequence: Why '25 e2 e3 cf d3' bits stream used in many document?)

However the text within the "/V" command is encoded in UTF-16 little endian. The "fe ff" bytes are actually the byte order mark of the text.

You will probably need to resort to using readBin and converting the bytes to the right encoding. PDFs are horrible to parse.

See this http://stat545.com/block034_useR-encoding-case-study.html post on how to read files with mixed encoding using readBin. The iconv function may be useful as well for encoding conversion

Using special characters in Rstudio

This not an exclusively RStudio problem.

Typing print("Safarzyńska2013") on the console of RGui also converts them to plain letters. Running this code from an UTF-8 encoded Script in RGui returns [1] "Safarzy?ska2013".

I don't think that it is a good idea to type such special chars on the console. x <- "SomeString"; Encoding(x) returns "unknown" and that is probably the problem: R has no idea what encoding you are using on the console and probably has no chance to get your original encoding.

I put "Safarzyńska2013\nMāori\n" in a text file encoded with UTF-8. Then the following works fine:

tbl <- read.table('c:/test1.txt', encoding = 'UTF-8', stringsAsFactors = FALSE)
tbl[1,1]
tbl[2,1]
Encoding(tbl[1,1]) # returns "UTF-8"

If you really want to use the console, you probably will have to mask the special chars. In ?Encoding we find the following example to create a word with special chars:

x <- "fa\xE7ile"
Encoding(x)

Actually I don't know at the moment how to get these codes for your special chars and ?Encoding has also no hints...

RStudio character encoding issue: quotation marks replaced by \x92

I just found a solution so I am answering my own question:

Somehow my attempts to set the encoding via the global options menu in RStudio server did not have any impact on read.csv (I thought it was supposed to use the encoding specified in the global options by default getOption("encoding"), but it does not seem to always be the case...)

Anyways, by specifying the type of encoding directly in read.csv using the fileEncoding argument, and by inspecting the data, I could see that this time my different encoding selections had an impact. After a couple of trials, I found that "Windows-1252" gave me what I wanted.

Cannot change encoding in a data frame in R

This code might be useful for your problem:

 con <- dbConnect(MySQL(),
user = 'user',
password = 'password',
host = 'url',
dbname='dbName')
m <- dbGetQuery(con, "SET NAMES 'latin1'")
sqlcmd <- paste("SELECT * FROM dbName.`users`");
result <- dbGetQuery(con, sqlcmd)
dbDisconnect(con)


Related Topics



Leave a reply



Submit