How to Use a Non-Ascii Symbol (E.G. £) in an R Package Function

How to use a non-ASCII symbol (e.g. £) in an R package function?

Looks like "Writing R Extensions" covers this in Section 1.7.1 "Encoding Issues".


One of the recommendations in this page is to use the Unicode encoding \uxxxx. Since £ is Unicode 00A3, you can use:

formatPound <- function(x, digits=2, nsmall=2, symbol="\u00A3"){
paste(symbol, format(x, digits=digits, nsmall=nsmall))
}

formatPound(123.45)
[1] "£ 123.45"

Using non-ASCII characters inside functions for packages

For the \uxxxx escapes, you need to know the hexadecimal number of your character. You can determine it using charToRaw:

sprintf("%X", as.integer(charToRaw("£")))
[1] "A3"

Now you can use this to specify your non-ascii character. Both \u00A3 and £ represent the same character.

Another option is to use stringi::stri_escape_unicode:

library(stringi)
stringi::stri_escape_unicode("➛")
# "\\u279b"

This informs you that "\u279b" represents the character "➛".

Encoding problem when your package contains functions with non-english characters

The key trick is replacing the non-ASCII characters with their unicode codes - the \uxxxx encoding.

These can be generated via stringi::stri_escape_unicode() function.

Note that since it will be necessary to completely get rid of the Korean characters in your code in order to pass the R CMD check it will be necessary to perform a manual copy & re-encode via {stringi} on the command line & paste back operation on all your R scripts included in the package.

I am not aware of an available automated solution for this problem.

In the specific use case of the example provided the unicode would read like this:

sampleprob <- function(url) {
# stringi::stri_escape_unicode("연결재무제표 주석") to get the \uxxxx codes
result <- grepl("\uc5f0\uacb0\uc7ac\ubb34\uc81c\ud45c \uc8fc\uc11d",
rvest::html_text(xml2::read_html(url)))
return(result)
}
sampleprob("http://dart.fss.or.kr/dsaf001/main.do?rcpNo=20200330003851")
[1] TRUE

This will be a hassle, but it seems to be the only way to make your code platform neutral (which is a key CRAN requirement, and thus subject to R CMD check).

Issue with Non-ASCII Characters in R

Upon further investigation, the answer is that the ’�’ already is decoded. At some point the original characters were not decoded, so windows defaults to basically saying “I don’t know what this is”, and it does that for any non-ASCII character.

For example, there’s no distinguishing between á and ¿ once reaching this point. There are crosswalks available for these types of characters, but they wouldn’t work here as replacement would need to be at the language level, which is an entirely different issue.

Essentially, one would either have to replace or remove the ’�’ and run a spell checker in multiple languages.

detect non ascii characters in a string

another possible way is to try to convert your string to ASCII and the try to detect all the generated non printable control characters which couldn't be converted

grepl("[[:cntrl:]]", stringi::stri_enc_toascii(x))
## [1] TRUE FALSE TRUE FALSE

Though it seems stringi has a built in function for this type of things too

stringi::stri_enc_mark(x)
# [1] "latin1" "ASCII" "latin1" "ASCII"

Replacement of non-ascii character in R function without warning in devtools::check()

You can use charToRaw("°") to get the \uxxxx escape code, and then use that in the R code. For example, I have code that uses ã in the word Não. To get through devtools::check(), this is needed:

 charToRaw("ã")  # answer is \u00a3

Then, Não becomes N\u00a3o in my code, and problem solved.

Removing non-ASCII characters from data files

To simply remove the non-ASCII characters, you could use base R's iconv(), setting sub = "". Something like this should work:

x <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher") # e.g. from ?iconv
Encoding(x) <- "latin1" # (just to make sure)
x
# [1] "Ekstrøm" "Jöreskog" "bißchen Zürcher"

iconv(x, "latin1", "ASCII", sub="")
# [1] "Ekstrm" "Jreskog" "bichen Zrcher"

To locate non-ASCII characters, or to find if there were any at all in your files, you could likely adapt the following ideas:

## Do *any* lines contain non-ASCII characters? 
any(grepl("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII")))
[1] TRUE

## Find which lines (e.g. read in by readLines()) contain non-ASCII characters
grep("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII"))
[1] 1 2 3


Related Topics



Leave a reply



Submit