How to Identify/Delete Non-Utf-8 Characters in R

How to identify/delete non-UTF-8 characters in R

Another solution using iconv and it argument sub: character string. If not NA(here I set it to ''), it is used to replace any non-convertible bytes in the input.

x <- "fa\xE7ile"
Encoding(x) <- "UTF-8"
iconv(x, "UTF-8", "UTF-8",sub='') ## replace any non UTF-8 by ''
"faile"

Here note that if we choose the right encoding:

x <- "fa\xE7ile"
Encoding(x) <- "latin1"
xx <- iconv(x, "latin1", "UTF-8",sub='')
facile

How to remove non UTF-8 characters from text

The signature of gsub is:

gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)

Not sure what you wanted to do with

gsub("â€™","â€˜","",txt)

but that line is probably not doing what you want it to do...

See here for a previous SO question on gsub and non-ascii symbols.

Edit:

Removing non-ASCII characters from data files

To simply remove the non-ASCII characters, you could use base R's iconv(), setting sub = "". Something like this should work:

x <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher") # e.g. from ?iconv
Encoding(x) <- "latin1"  # (just to make sure)
x
# [1] "Ekstrøm"         "Jöreskog"        "bißchen Zürcher"

iconv(x, "latin1", "ASCII", sub="")
# [1] "Ekstrm"        "Jreskog"       "bichen Zrcher"

To locate non-ASCII characters, or to find if there were any at all in your files, you could likely adapt the following ideas:

## Do *any* lines contain non-ASCII characters? 
any(grepl("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII")))
[1] TRUE

## Find which lines (e.g. read in by readLines()) contain non-ASCII characters
grep("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII"))
[1] 1 2 3

Add \\ to escape non-UTF 8 characters within a string using regex

You may use

str_replace_all(x, "[~!@#$%^&*(){}_\\\\<>?\\[\\]|-]", "\\\\\\0")

A base R approach:

gsub("([]\\~!@#$%^&*(){}_<>?[|-])", "\\\\\\1", "~!@#$%^&*(){}_\\<>?[]|-")

See the regex demo.

Details

[ - start of a character class matching any of the following chars:
- ~ - ~
- ! - !
- @ - @
- # - #
- $ - $
- % - %
- ^ - ^ (if you put it at the start, escape with \\)
- & - &
- * - * (no need to escape inside a character class)
- ( - (
- ) - )
- { - {
- } - }
- _ - _ (note it is a word char, and \W would not match it)
- \\\\ - a \ char (a literal \ escaped with another literal \)
- < - a <
- > - >
- ? - ?
- \\[ - a [ char (in ICU regex, must be escaped inside a character class
- \\] - a ] char (ibid.)
- | - a | char (it is not an OR operator inside a character class)
- - - a - char
] - end of the character class.

The "\\\\\\0" string replacement pattern is parsed as two literal backslashes that defines a singular literal backslash and a \0 literal string that is a backreference to the whole match in the ICU regex in R.

Note that gsub TRE regex is a bit trickier: ] must be the first char in the character class, [ should not be escaped, literal \ should only be single (no regex escape sequences are supported inside TRE patterns), and - must be at the end. Also, there is no support for the whole match backreference, hence, you need to wrap the whole pattern with a capturing group and replace with \1 backreference.

How to Identify/Delete Non-Utf-8 Characters in R