R Reticulate - Reading file path having arabic (persian) UTF-8 characters in file name
Issue resolved as per JosefZ's comment by setting Sys.setlocale("LC_ALL", "persian.65001")
How to save non-English characters in text file in R?
Found a workaround in the Write file as UTF-8 encoding in R for Windows article (with thorough explanation there):
BOM <- charToRaw('\xEF\xBB\xBF')
writeUtf8 <- function(x, file, bom=F) {
Encoding(x) <- "UTF-8" # superabundant?
con <- file(file, "wb")
if(bom) writeBin(BOM, con, endian="little")
writeBin(charToRaw(x), con, endian="little")
close(con)
}
data <- c("चौपाई")
writeUtf8(x=data, file="data.txt")
Explanation (copy & paste from abovementioned article, partially truncated):
Difference between Windows and other OSs
I am trying to say as simple as I can. The Windows chooses one of many
language sets, however, the Linux and the Mac OS choose one language
subset of a UTF-8 set. By this difference, the Windows forgets
characters of unselected languages, while other OSs remember
characters of all languages.Problem on Windows
When a text is written to a file, characters of unselected locale
languages can not be handled. Some of them are converted into a
similar (but incorrect!) character, and others are written as escaped
format such as <U+222D>.Mind that the R is not responsible for this problem. Because the OS’s
architecture of switching languages is generating the problem.… when the R writes a UTF-8 text into a file on Windows, characters of
unsupported language are modified. In contrast, all characters are
written correctly in Mac OS.Using binary
There is a solution for this problem. Writing a binary file instead of
a text file solves this. All applications handling a UTF-8 file in
Windows are using the same trick.BOM
The BOM should not be used in UTF-8 files. This is what the Linux and
the Mac OS are doing. But the Windows Notepad and some applications
use the BOM. So, handling the BOM is needed, in spite of grammatically
wrong.The BOM is the 3 bytes character put at the beginning of a text file,
but because the R does not use the BOM, it should be removed on
reading.BOM <- charToRaw('\xEF\xBB\xBF')
Write UTF-8 file
writeUtf8 <- function(x, file, bom=F) {
con <- file(file, "wb")
if(bom) writeBin(BOM, con, endian="little")
writeBin(charToRaw(x), con, endian="little")
close(con)
}Specify a UTF-8 string as x=, and a file name to write as file=. If
you want to read the file only with the Windows Notepad, adding a BOM
by thebom=T
option is a good choice. Note that this is a minimum
script, and not meant to write a very large file.
Edit
Please note the encoding stuff (Encoding(result) <- "UTF-8"
) added to both readUtf8
and readUtf8Text
functions:
Reading a UTF-8 is easy, because functions like readLines have
encoding= options that accepts UTF-8.
readUtf8Text <- function(file) {
con <- file(file, 'rt')
result <- readLines(con, encoding='utf-8')
close(con)
Encoding(result) <- "UTF-8" # important
result
}
If you want to read a UTF-8 file saved by Windows standard
applications like Notepad, you may have a trouble. Because the Windows
Notepad appends BOM at writing a UTF-8 file, you must remove the BOM
on the R. Or the BOM will appear as a corrupted character at the
beginning of the string.Now, the R 3.0.0 supports UTF-8-BOM encoding to remove the BOM.
However, if you want to use R 2.15.3 for a while, you must remove the
BOM manually. The following code reads a UTF-8 file as binary and
remove the BOM.Note that this is a minimum script, and not meant to read a very large
file.
readUtf8 <- function(file) {
size <- file.info(file)$size
con <- file(file, "rb")
x <- readBin(con, raw(), size, endian="little")
close(con)
pstart <- ifelse(all(x[1:3]==BOM), 4, 1)
pend <- length(x)
result <- rawToChar(x[pstart:pend])
Encoding(result) <- "UTF-8" # important
result
}
Result
Tested in RStudio
1.3 as well as in RGui
4.0.1 (Windows 10/64bit, i.e. platform
x86_64-w64-mingw32`):
> data <- c("चौपाई")
> writeUtf8(x=data, file="data.txt")
>
> data
[1] "चौपाई"
>
> readUtf8Text(file="data.txt")
[1] "चौपाई"
>
> readUtf8(file="data.txt")
[1] "चौपाई"
To demonstrate importance of Encoding(result) <- "UTF-8"
in both read functions for preventing mojibake:
> file <- "data.txt"
> con <- file(file, 'rt')
> result <- readLines(con, encoding='utf-8')
> close(con)
> result # mojibake
[1] "चौपाई"
> Encoding(result) <- "UTF-8"
> result
[1] "चौपाई"
>
RStudio does not read non-English characters in paths
This is a known bug in RStudio; see https://github.com/rstudio/rstudio/issues/10451. If you're willing to try a fix, we have one in the dailies as of last week:
https://dailies.rstudio.com/
Related Topics
Installing Ggplot2 Package on Ubuntu
Calculate Percentage for Each Time Series Observations Per Group in R
Lm(): What Is Qraux Returned by Qr Decomposition in Linpack/Lapack
X^(1/3)' Behaves Differently for Negative Scalar 'X' and Vector 'X' with Negative Values
What Exactly Does Complete in Mice Do
Get Data Frame from Character Variable
Splitting String Between Capital and Lowercase Character in R
R Ggplot2: Using Stat_Summary (Mean) and Logarithmic Scale
Operator Precedence of "Unary Minus" (-) and Exponentiation (^) Outside VS. Inside Function
Row Not Consolidating Duplicates in R When Using Multiple Months in Date Filter
How to Compare Two Factors with Different Levels
Plotting Wide Format Data Using R Ggplot
Convert String of Anyformat into Dd-Mm-Yy Hh:Mm:Ss in R
How to Replace Lower/Upper Triangular Elements of a Matrix
Results Transposed with R Apply
Convert Factor to Date Class for Multiple Columns
Outputting Difftime as Hh:Mm:Ss:Mm in R
Count Unique Values of a Column by Pairwise Combinations of Another Column in R