How to Identify/Delete Non-Utf-8 Characters in R

How to identify/delete non-UTF-8 characters in R

Another solution using iconv and it argument sub: character string. If not NA(here I set it to ''), it is used to replace any non-convertible bytes in the input.

x <- "fa\xE7ile"
Encoding(x) <- "UTF-8"
iconv(x, "UTF-8", "UTF-8",sub='') ## replace any non UTF-8 by ''
"faile"

Here note that if we choose the right encoding:

x <- "fa\xE7ile"
Encoding(x) <- "latin1"
xx <- iconv(x, "latin1", "UTF-8",sub='')
facile

How to remove non UTF-8 characters from text

The signature of gsub is:

gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)

Not sure what you wanted to do with

gsub("’","‘","",txt)

but that line is probably not doing what you want it to do...

See here for a previous SO question on gsub and non-ascii symbols.

Edit:

Suggested solution using iconv:

Removing all non-ASCII characters:

txt <- "’xxx‘"

iconv(txt, "latin1", "ASCII", sub="")

Returns:

[1] "xxx"    

Removing non-ASCII characters from data files

To simply remove the non-ASCII characters, you could use base R's iconv(), setting sub = "". Something like this should work:

x <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher") # e.g. from ?iconv
Encoding(x) <- "latin1" # (just to make sure)
x
# [1] "Ekstrøm" "Jöreskog" "bißchen Zürcher"

iconv(x, "latin1", "ASCII", sub="")
# [1] "Ekstrm" "Jreskog" "bichen Zrcher"

To locate non-ASCII characters, or to find if there were any at all in your files, you could likely adapt the following ideas:

## Do *any* lines contain non-ASCII characters? 
any(grepl("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII")))
[1] TRUE

## Find which lines (e.g. read in by readLines()) contain non-ASCII characters
grep("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII"))
[1] 1 2 3

Add \\ to escape non-UTF 8 characters within a string using regex

You may use

str_replace_all(x, "[~!@#$%^&*(){}_\\\\<>?\\[\\]|-]", "\\\\\\0")

A base R approach:

gsub("([]\\~!@#$%^&*(){}_<>?[|-])", "\\\\\\1", "~!@#$%^&*(){}_\\<>?[]|-")

See the regex demo.

Details

  • [ - start of a character class matching any of the following chars:

    • ~ - ~
    • ! - !
    • @ - @
    • # - #
    • $ - $
    • % - %
    • ^ - ^ (if you put it at the start, escape with \\)
    • & - &
    • * - * (no need to escape inside a character class)
    • ( - (
    • ) - )
    • { - {
    • } - }
    • _ - _ (note it is a word char, and \W would not match it)
    • \\\\ - a \ char (a literal \ escaped with another literal \)
    • < - a <
    • > - >
    • ? - ?
    • \\[ - a [ char (in ICU regex, must be escaped inside a character class
    • \\] - a ] char (ibid.)
    • | - a | char (it is not an OR operator inside a character class)
    • - - a - char
  • ] - end of the character class.

The "\\\\\\0" string replacement pattern is parsed as two literal backslashes that defines a singular literal backslash and a \0 literal string that is a backreference to the whole match in the ICU regex in R.

Note that gsub TRE regex is a bit trickier: ] must be the first char in the character class, [ should not be escaped, literal \ should only be single (no regex escape sequences are supported inside TRE patterns), and - must be at the end. Also, there is no support for the whole match backreference, hence, you need to wrap the whole pattern with a capturing group and replace with \1 backreference.



Related Topics



Leave a reply



Submit