How Does R Handle Unicode/Utf-8

How to handle utf-8 characters in R

I had some non- ASCII characters in my tweet.
Using this code

tweet_corpus= tm_map(tweet_corpus, function(x) iconv(x, "latin1", "ASCII", sub=""))

I was able to solve the issue.

UTF-8 and unicode in modelsummary's modelplot() figures

I cannot replicate this problem on my Linux or Mac machines, so this appears to be a Windows-specific issue. UTF-8 and unicode support in R is notoriously finicky on Windows.

That said, on my own Windows machine at least, this code below produces the graph you want. The trick is to assign model names to the list after creating the list.

library(modelsummary)

model <- list(
lm(mpg ~ ., data = mtcars),
lm(Sepal.Length ~ ., data = iris))
names(model) <- c("モデル2", "モデル1")

modelplot(model)

The code above does produce a warning, and I'm not sure how to get rid of it. Frankly, I am not an encoding expert, so if anyone has insight into this issue, please join the disscussion here:

https://github.com/vincentarelbundock/modelsummary/issues/345

Read in file with UTF-8 character in path in R

At first I thought your locale was the problem; windows-1252 doesn't contain "Ń". But I couldn't reproduce your error even with filenames like ".rds" with latin1 encoding and german locale.

But the amount of whitespace in your error was more that I got for files that didn't exist... Then I spotted the leading space in your example output.

[1] " G:/Users/SomeUser/Documents/University/2021/Project_M/data/procyclingstats/BORA_hansgrohe /POLJAŃSKI_Paweł_sprinter_point.rds"

That could explain why it prints "okay" (we don't see whitespace), but trying to read would fail. It does leave me puzzled about why your other files read without problem.

If that isn't the problem than it may be the relative recent support for utf-8 in Windows. Historically they have used ucs-2 and utf-16 internally. "Turning on" utf-8 support requires a different C runtime. There is an experimental build of R that you could try out that uses that runtime. But that requires you to rebuild your libraries (readr!) with that runtime too.

Before messing up your whole R installation, I'd test with the experimental build if you can read a file called Ń.csv.

How to source() .R file saved using UTF-8 encoding?

We talked about this a lot in the comments to my previous post but I don't want this to get lost on page 3 of comments: You have to set the locale, it works with both input from the R-console (see screenshot in comments) as well as with input from file see this screenshot:

Sample Image

The file "myfile.r" contains:

russian <- function() print ("Американские с...");

The console contains:

source("myfile.r", encoding="utf-8")
> Error in source(".....
Sys.setlocale("LC_CTYPE","ru")
> [1] "Russian_Russia.1251"
russian()
[1] "Американские с..."

Note that the file-in fails and it points to the same character as the original poster's error (the one after "R). I can not do this with Chinese because i would have to install "Microsoft Pinyin IME 3.0", but the process is the same, you just replace the locale with "chinese" (the naming is a bit inconsistent, consult the documentation).

convert utf8 code point strings like U+0161 to utf8

Perhaps:

library(stringi)
library(magrittr)

"foo<U+0161>bar and cra<U+017E>y" %>%
stri_replace_all_regex("<U\\+([[:alnum:]]+)>", "\\\\u$1") %>%
stri_unescape_unicode() %>%
stri_enc_toutf8()
## [1] "foošbar and cražy"

may work (I don't need the last conversion on macOS but you may on Windows).



Related Topics



Leave a reply



Submit