Manipulating Files with Non-English Names in R

R Reticulate - Reading file path having arabic (persian) UTF-8 characters in file name

Issue resolved as per JosefZ's comment by setting Sys.setlocale("LC_ALL", "persian.65001")

How to save non-English characters in text file in R?

Found a workaround in the Write file as UTF-8 encoding in R for Windows article (with thorough explanation there):

BOM <- charToRaw('\xEF\xBB\xBF')

writeUtf8 <- function(x, file, bom=F) {
Encoding(x) <- "UTF-8" # superabundant?
con <- file(file, "wb")
if(bom) writeBin(BOM, con, endian="little")
writeBin(charToRaw(x), con, endian="little")
close(con)
}

data <- c("चौपाई")
writeUtf8(x=data, file="data.txt")

Explanation (copy & paste from abovementioned article, partially truncated):

Difference between Windows and other OSs

I am trying to say as simple as I can. The Windows chooses one of many
language sets, however, the Linux and the Mac OS choose one language
subset of a UTF-8 set. By this difference, the Windows forgets
characters of unselected languages, while other OSs remember
characters of all languages.

Problem on Windows

When a text is written to a file, characters of unselected locale
languages can not be handled. Some of them are converted into a
similar (but incorrect!) character, and others are written as escaped
format such as <U+222D>.

Mind that the R is not responsible for this problem. Because the OS’s
architecture of switching languages is generating the problem.

… when the R writes a UTF-8 text into a file on Windows, characters of
unsupported language are modified. In contrast, all characters are
written correctly in Mac OS.

Using binary

There is a solution for this problem. Writing a binary file instead of
a text file solves this. All applications handling a UTF-8 file in
Windows are using the same trick.

BOM

The BOM should not be used in UTF-8 files. This is what the Linux and
the Mac OS are doing. But the Windows Notepad and some applications
use the BOM. So, handling the BOM is needed, in spite of grammatically
wrong.

The BOM is the 3 bytes character put at the beginning of a text file,
but because the R does not use the BOM, it should be removed on
reading.

BOM <- charToRaw('\xEF\xBB\xBF')

Write UTF-8 file

writeUtf8 <- function(x, file, bom=F) {
con <- file(file, "wb")
if(bom) writeBin(BOM, con, endian="little")
writeBin(charToRaw(x), con, endian="little")
close(con)
}

Specify a UTF-8 string as x=, and a file name to write as file=. If
you want to read the file only with the Windows Notepad, adding a BOM
by the bom=T option is a good choice. Note that this is a minimum
script, and not meant to write a very large file.

Edit

Please note the encoding stuff (Encoding(result) <- "UTF-8") added to both readUtf8 and readUtf8Text functions:

Reading a UTF-8 is easy, because functions like readLines have
encoding= options that accepts UTF-8.

readUtf8Text <- function(file) {
con <- file(file, 'rt')
result <- readLines(con, encoding='utf-8')
close(con)
Encoding(result) <- "UTF-8" # important
result
}

If you want to read a UTF-8 file saved by Windows standard
applications like Notepad, you may have a trouble. Because the Windows
Notepad appends BOM at writing a UTF-8 file, you must remove the BOM
on the R. Or the BOM will appear as a corrupted character at the
beginning of the string.

Now, the R 3.0.0 supports UTF-8-BOM encoding to remove the BOM.
However, if you want to use R 2.15.3 for a while, you must remove the
BOM manually. The following code reads a UTF-8 file as binary and
remove the BOM.

Note that this is a minimum script, and not meant to read a very large
file.

readUtf8 <- function(file) {
size <- file.info(file)$size
con <- file(file, "rb")
x <- readBin(con, raw(), size, endian="little")
close(con)
pstart <- ifelse(all(x[1:3]==BOM), 4, 1)
pend <- length(x)
result <- rawToChar(x[pstart:pend])
Encoding(result) <- "UTF-8" # important
result
}

Result

Tested in RStudio 1.3 as well as in RGui 4.0.1 (Windows 10/64bit, i.e. platform x86_64-w64-mingw32`):

> data <- c("चौपाई")
> writeUtf8(x=data, file="data.txt")
>
> data
[1] "चौपाई"
>
> readUtf8Text(file="data.txt")
[1] "चौपाई"
>
> readUtf8(file="data.txt")
[1] "चौपाई"

To demonstrate importance of Encoding(result) <- "UTF-8" in both read functions for preventing mojibake:

> file <- "data.txt"
> con <- file(file, 'rt')
> result <- readLines(con, encoding='utf-8')
> close(con)
> result # mojibake
[1] "चौपाई"
> Encoding(result) <- "UTF-8"
> result
[1] "चौपाई"
>

RStudio does not read non-English characters in paths

This is a known bug in RStudio; see https://github.com/rstudio/rstudio/issues/10451. If you're willing to try a fix, we have one in the dailies as of last week:

https://dailies.rstudio.com/



Related Topics



Leave a reply



Submit