Why Do Some Unicode Characters Display in Matrices, But Not Data Frames in R

Why do some Unicode characters display in matrices, but not data frames in R?

I hate to answer my own question, but although the comments and answers helped, they weren't quite right. In Windows, it doesn't seem like you can set a generic 'UTF-8' locale. You can, however, set country-specific locales, which will work in this case:

Sys.setlocale("LC_CTYPE", locale="Chinese")
q2 # Works fine
#  q
#1 天

But, it does make me wonder why exactly format seems to use the locale; I wonder if there is a way to have it ignore the locale in Windows. I also wonder if there is some generic UTF-8 locale that I don't know about on Windows.

Tibetan characters in data.frames cannot be displayed in R console even after locale is set (Chinese is fine, Tibetan in matrices is fine)

This appears to be fixed in R 4.2.0:

> Sys.setlocale("LC_CTYPE", "Tibetan")
[1] "Tibetan_China.utf8"
> 
> data.frame(a="བོད་")
    a
1 བོད་

Not a satisfactory answer at the time I posted this, but certainly works well now!

Differences in Unicode character output with print()

So, apparently this character conversion issue is unlikely to resolve itself in the near future, and will probably only be solved at the OS level. But based on the excellent suggestions made by @YihuiXie in the comments, there are two ways this issue can be worked around. The best solution will depend on the context that your are creating the tables in.

Scenario 1: Tables Only

If the only type of object you need to output from inside your for-loop are tables, then you can accumulate the kable objects in a list inside the loop, then collapse the list of kables into a single character vector at the conclusion of the loop, and display it using knitr::asis_output.

```{r, results="asis"}
library(knitr)
character_list <- list(eta="\U03B7", sigma="\U03C3")
kable_list <- vector(mode="list", length = length(character_list))

for (i in 1:length(character_list)) {
  kable_list[[i]] <- knitr::kable(as.data.frame(character_list[i]),
                                  format="html"
                                  )
}

knitr::asis_output(paste(kable_list, collapse = '\n'))
```

This produces the following tables in the HTML document:
Sample Image

Scenario 2: Tables and other objects (e.g. Plots)

If you're outputting both tables and other objects (e.g., plots) on each iteration of your for-loop, then the above solution wont work - you can't coerce your plots to a character vector! At this point, we have to result to some post-processing of the kable output by writing a customized knitr output hook.

The basic approach will be to replace the busted sequences in the table cells with the equivalent HTML entities. Note that because the table is created in an results="asis" chunk, we have to override the chunk level output hook, not the output level output hook (confusing, I know).

```{r hook_override}
library(knitr)
default_hook <- knit_hooks$get("chunk")

knit_hooks$set(chunk = function(x, options) {
  # only attempt substitution if output is a character vector, which I *think* it always should be 
  if (is.character(x)) {
    # Match the <U+XXXX> pattern in the output
    match_data <- gregexpr("<U\\+[0-9A-F]{4,8}>", x)
    # If a match is found, proceed with HTML entity substitution
    if (length(match_data[[1]]) >= 1 && match_data[[1]][1] != -1) {
      # Extract the matched strings from the output
      match_strings <- unlist(regmatches(x, match_data))
      # Extract the hexadecimal Unicode sequences from inside the <U > bracketing
      code_sequences <- unlist(regmatches(match_strings,
                                          gregexpr("[0-9A-F]{4,8}", match_strings)
                                          )
                               )
      # Replace any leading zero's with x, which is required for the HTML entities
      code_sequences <- gsub("^0{1,4}", "x", code_sequences)
      # Slap the &# on the front, and the ; on the end of each code sequence
      regmatches(x, match_data) <- list(paste0("&#", code_sequences, ";"))
    }
  }
  # "Print" the output
  default_hook(x, options)
})
``` 

```{r tables, results="asis"}
character_list <- list(eta="\U03B7", sigma="\U03C3")  
for (i in 1:length(character_list)) {
  x <- knitr::kable(as.data.frame(character_list[i]),
                    format="html"
                    )
  print(x)
}
```

```{r hook_reset}
knit_hooks$set(chunk = default_hook)
```

This produces the following tables in the HTML document:

Sample Image

Note that this time, the sigma doesn't display as σ like it did in the first example, it displays as s! This is because the sigma gets converted to an s before it gets to the chunk output hook! I have no idea how to stop that from happening. Feel free to leave a comment if you do =)

I also realize that using regular expressions to do the substitutions within the HTML table is probably fragile. If this approach happens to fail for your use case, perhaps using the rvest package to parse out each table cell individually would be more robust.

Generic way to avoid special characters in R

This should help you :

Encoding(x) <- "UTF-8"

iconv(dtm, "UTF-8", "ASCII", sub="")

utf-8 characters get lost when converting from list to data.frame in R

This is not a perfect answer, but the following workaround solved the problem for me. I tried to understand the behavior or R, and make the example so that my R script produces the same results both on Windows and on Linux platform:

(1) Get XML data in UTF-8 from the Internet

library(XML)
url <- "http://hydrodata.info/chmi-h/cuahsi_1_1.asmx/GetSiteInfoObject?site=CHMI-H:1263&authToken="
doc <- xmlRoot(xmlTreeParse(url, getDTD=FALSE, useInternalNodes = TRUE))
infoList <- xmlToList(doc[[2]][[1]])
siteName <- infoList$siteName

(2) Print out the text from the Internet: Encoding is UTF-8, display in the R console is also correct using both the Czech and the English locale on Windows:

> Sys.getlocale(category="LC_CTYPE")
[1] "English_United States.1252"
> print(siteName)
[1] "Koryčany nad přehradou"
> Encoding(siteName)
[1] "UTF-8"
>

(3) Try to create and view a data.frame. This has a problem. The data.frame displays incorrectly both in the RStudio view and in the console:

df <- data.frame(name=siteName, id=1)
df
                    name id
1 Korycany nad prehradou  1

(4) Try to use a matrix instead. Surprisingly the matrix displays correctly in the R console.

m <- as.matrix(df)
View(m)  #this shows incorrectly in RStudio
m        #however, this shows correctly in the R console.
     name                     id 
[1,] "Koryčany nad přehradou" "1"

(5) Change the locale. If I'm on Windows, set locale to Czech. If I'm on Unix or Mac, set locale to UTF-8. NOTE: This has some problems when I run the script in RStudio, apparently RStudio doesn't always react immediately to the Sys.setlocale command.

#remember the original locale.
original.locale <- Sys.getlocale(category="LC_CTYPE")

#for Windows set locale to Czech. Otherwise set locale to UTF-8
new.locale <- ifelse(.Platform$OS.type=="windows", "Czech_Czech Republic.1250", "en_US.UTF-8")
Sys.setlocale("LC_CTYPE", new.locale)

(7) Write the data to a text file. IMPORTANT: don't use write.csv but instead use write.table. When my locale is Czech on my English Windows, I must use the fileEncoding="UTF-8" in the write.table. Now the text file shows up correctly in notepad++ and in also in Excel.

write.table(m, "test-czech-utf8.txt", sep="\t", fileEncoding="UTF-8")

(8) Set the locale back to original

Sys.setlocale("LC_CTYPE", original.locale)

(9) Try to read the text file back into R. NOTE: If I read the file, I had to set the encoding parameter (NOT fileEncoding !). The display of a data.frame read from the file is still incorrect, but when I convert my data.frame to a matrix the Czech UTF-8 characters are preserved:

data.from.file <- read.table("test-czech-utf8.txt", sep="\t", encoding="UTF-8")
#the data.frame still has the display problem, "č" and "ř" get "lost"
> data.from.file
                     name id
1 Korycany nad prehradou  1

#see if a matrix displays correctly: YES it does!
matrix.from.file <- as.matrix(data.from.file)
> matrix.from.file
  name                     id 
1 "Koryčany nad přehradou" "1"

So the lesson learnt is that I need to convert my data.frame to a matrix, set my locale to Czech (on Windows) or to UTF-8 (on Mac and Linux) before I write my data with Czech characters to a file. Then when I write the file, I must make sure fileEncoding must be set to UTF-8. On the other hand when I later read the file, I can keep working in the English locale, but in read.table I must set the encoding="UTF-8".

If anybody has a better solution, I'll welcome your suggestions.

Displaying the contents in local language : R

You likely have the text that you want, It simply is being displayed incorrectly.

I can reproduce your problem. Your example data had the same strings 10 times.
To keep the display reasonable, I am only repeating 3 times.

## Hex codes from your example
S1 = c("0926", "094B", "0932", "0916", "093E") 
S2 = c("0915", "093E", "0932", "093F", "0928", "094D", "091A", "094B", "0915")  
S3 = c("0917", "093E", "0909", "0901", "092A", "093E", "0932", "093F", "0915", "093E")

## Convert to Devanagari strings
X1 = rep(intToUtf8(strtoi(S1, base=16L)), 3)
X2 = rep(intToUtf8(strtoi(S2, base=16L)), 3)
X3 = rep(intToUtf8(strtoi(S3, base=16L)), 3)

df = data.frame(X1, X2, X3, stringsAsFactors=FALSE)

Now X1 will display correctly, but df will not

Bizarrely, df$X1 and df[,1] will display the unicode,
but df[1, ] will not.

A workaround is that as.matrix(df) will display the whole thing
as unicode characters.

This is apparently a known bug in the Windows version of the RGui.
Some other explorations of this can be found at this
Earlier SO Question
and this Mailing List Post

Addendum

Writing these strings to a readable Unicode file requires some care.
This created a csv file for my example.

Mat = as.matrix(df)
F <- file("Test1.csv", "wb", encoding="UTF-8")
BOM <- charToRaw('\xEF\xBB\xBF')
writeBin(BOM, F)
for(r in 1:nrow(Mat)) {
    Line = paste(Mat[r,], collapse=",")
    writeLines(Line, F, useBytes=T) 
}
close(F)

Why Do Some Unicode Characters Display in Matrices, But Not Data Frames in R