How to use a non-ASCII symbol (e.g. £) in an R package function?
Looks like "Writing R Extensions" covers this in Section 1.7.1 "Encoding Issues".
One of the recommendations in this page is to use the Unicode encoding \uxxxx
. Since £ is Unicode 00A3, you can use:
formatPound <- function(x, digits=2, nsmall=2, symbol="\u00A3"){
paste(symbol, format(x, digits=digits, nsmall=nsmall))
}
formatPound(123.45)
[1] "£ 123.45"
Using non-ASCII characters inside functions for packages
For the \uxxxx escapes, you need to know the hexadecimal number of your character. You can determine it using charToRaw
:
sprintf("%X", as.integer(charToRaw("£")))
[1] "A3"
Now you can use this to specify your non-ascii character. Both \u00A3
and £
represent the same character.
Another option is to use stringi::stri_escape_unicode
:
library(stringi)
stringi::stri_escape_unicode("➛")
# "\\u279b"
This informs you that "\u279b"
represents the character "➛"
.
Encoding problem when your package contains functions with non-english characters
The key trick is replacing the non-ASCII characters with their unicode codes - the \uxxxx
encoding.
These can be generated via stringi::stri_escape_unicode()
function.
Note that since it will be necessary to completely get rid of the Korean characters in your code in order to pass the R CMD check it will be necessary to perform a manual copy & re-encode via {stringi}
on the command line & paste back operation on all your R scripts included in the package.
I am not aware of an available automated solution for this problem.
In the specific use case of the example provided the unicode would read like this:
sampleprob <- function(url) {
# stringi::stri_escape_unicode("연결재무제표 주석") to get the \uxxxx codes
result <- grepl("\uc5f0\uacb0\uc7ac\ubb34\uc81c\ud45c \uc8fc\uc11d",
rvest::html_text(xml2::read_html(url)))
return(result)
}
sampleprob("http://dart.fss.or.kr/dsaf001/main.do?rcpNo=20200330003851")
[1] TRUE
This will be a hassle, but it seems to be the only way to make your code platform neutral (which is a key CRAN requirement, and thus subject to R CMD check).
Issue with Non-ASCII Characters in R
Upon further investigation, the answer is that the ’�’ already is decoded. At some point the original characters were not decoded, so windows defaults to basically saying “I don’t know what this is”, and it does that for any non-ASCII character.
For example, there’s no distinguishing between á and ¿ once reaching this point. There are crosswalks available for these types of characters, but they wouldn’t work here as replacement would need to be at the language level, which is an entirely different issue.
Essentially, one would either have to replace or remove the ’�’ and run a spell checker in multiple languages.
detect non ascii characters in a string
another possible way is to try to convert your string to ASCII and the try to detect all the generated non printable control characters which couldn't be converted
grepl("[[:cntrl:]]", stringi::stri_enc_toascii(x))
## [1] TRUE FALSE TRUE FALSE
Though it seems stringi
has a built in function for this type of things too
stringi::stri_enc_mark(x)
# [1] "latin1" "ASCII" "latin1" "ASCII"
Replacement of non-ascii character in R function without warning in devtools::check()
You can use charToRaw("°")
to get the \uxxxx
escape code, and then use that in the R code. For example, I have code that uses ã
in the word Não
. To get through devtools::check()
, this is needed:
charToRaw("ã") # answer is \u00a3
Then, Não
becomes N\u00a3o
in my code, and problem solved.
Removing non-ASCII characters from data files
To simply remove the non-ASCII characters, you could use base R's iconv()
, setting sub = ""
. Something like this should work:
x <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher") # e.g. from ?iconv
Encoding(x) <- "latin1" # (just to make sure)
x
# [1] "Ekstrøm" "Jöreskog" "bißchen Zürcher"
iconv(x, "latin1", "ASCII", sub="")
# [1] "Ekstrm" "Jreskog" "bichen Zrcher"
To locate non-ASCII characters, or to find if there were any at all in your files, you could likely adapt the following ideas:
## Do *any* lines contain non-ASCII characters?
any(grepl("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII")))
[1] TRUE
## Find which lines (e.g. read in by readLines()) contain non-ASCII characters
grep("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII"))
[1] 1 2 3
Related Topics
Adding All Elements of Two Lists
R - Data Frame - Convert to Sparse Matrix
Tiny Plot Output from Sankeynetwork (Networkd3) in Firefox
Aggregating Values on a Data Tree with R
Obtaining Percent Scales Reflective of Individual Facets with Ggplot2
Creating a More Continuous Color Palette in R, Ggplot2, Lattice, or Latticeextra
Convert Map Data to Data Frame Using Fortify {Ggplot2} for Spatial Objects in R
Calling External Program from R with Multiple Commands in System
How to Simultaneously Apply Color/Shape/Size in a Scatter Plot Using Plotly
How to Replace the String Exactly Using Gsub()
Scraping a Complex HTML Table into a Data.Frame in R
Saving a File to Sharepoint with R
Search for Corresponding Node in a Regression Tree Using Rpart