How to identify/delete non-UTF-8 characters in R
Another solution using iconv
and it argument sub
: character string. If not NA(here I set it to ''), it is used to replace any non-convertible bytes in the input.
x <- "fa\xE7ile"
Encoding(x) <- "UTF-8"
iconv(x, "UTF-8", "UTF-8",sub='') ## replace any non UTF-8 by ''
"faile"
Here note that if we choose the right encoding:
x <- "fa\xE7ile"
Encoding(x) <- "latin1"
xx <- iconv(x, "latin1", "UTF-8",sub='')
facile
How to remove non UTF-8 characters from text
The signature of gsub
is:
gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
Not sure what you wanted to do with
gsub("’","‘","",txt)
but that line is probably not doing what you want it to do...
See here for a previous SO question on gsub and non-ascii symbols.
Edit:
Suggested solution using iconv
:
Removing all non-ASCII characters:
txt <- "’xxx‘"
iconv(txt, "latin1", "ASCII", sub="")
Returns:
[1] "xxx"
Removing non-ASCII characters from data files
To simply remove the non-ASCII characters, you could use base R's iconv()
, setting sub = ""
. Something like this should work:
x <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher") # e.g. from ?iconv
Encoding(x) <- "latin1" # (just to make sure)
x
# [1] "Ekstrøm" "Jöreskog" "bißchen Zürcher"
iconv(x, "latin1", "ASCII", sub="")
# [1] "Ekstrm" "Jreskog" "bichen Zrcher"
To locate non-ASCII characters, or to find if there were any at all in your files, you could likely adapt the following ideas:
## Do *any* lines contain non-ASCII characters?
any(grepl("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII")))
[1] TRUE
## Find which lines (e.g. read in by readLines()) contain non-ASCII characters
grep("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII"))
[1] 1 2 3
Add \\ to escape non-UTF 8 characters within a string using regex
You may use
str_replace_all(x, "[~!@#$%^&*(){}_\\\\<>?\\[\\]|-]", "\\\\\\0")
A base R approach:
gsub("([]\\~!@#$%^&*(){}_<>?[|-])", "\\\\\\1", "~!@#$%^&*(){}_\\<>?[]|-")
See the regex demo.
Details
[
- start of a character class matching any of the following chars:~
-~
!
-!
@
-@
#
-#
$
-$
%
-%
^
-^
(if you put it at the start, escape with\\
)&
-&
*
-*
(no need to escape inside a character class)(
-(
)
-)
{
-{
}
-}
_
-_
(note it is a word char, and\W
would not match it)\\\\
- a\
char (a literal\
escaped with another literal\
)<
- a<
>
->
?
-?
\\[
- a[
char (in ICU regex, must be escaped inside a character class\\]
- a]
char (ibid.)|
- a|
char (it is not an OR operator inside a character class)-
- a-
char
]
- end of the character class.
The "\\\\\\0"
string replacement pattern is parsed as two literal backslashes that defines a singular literal backslash and a \0
literal string that is a backreference to the whole match in the ICU regex in R.
Note that gsub
TRE regex is a bit trickier: ]
must be the first char in the character class, [
should not be escaped, literal \
should only be single (no regex escape sequences are supported inside TRE patterns), and -
must be at the end. Also, there is no support for the whole match backreference, hence, you need to wrap the whole pattern with a capturing group and replace with \1
backreference.
Related Topics
Removing Duplicate Combinations (Irrespective of Order)
Split String Column to Create New Binary Columns
Select Groups Based on Number of Unique/Distinct Values
R Reshape Data Frame from Long to Wide Format
Sample Random Rows in Dataframe
Use First Row Data as Column Names in R
How to Get Rowsums for Selected Columns in R
R Collapse Multiple Rows into 1 Row - Same Columns
How Split Column of List-Values into Multiple Columns
Adding a New Column Based Upon Values in Another Column Using Dplyr
Select Rows from a Data Frame Based on Values in a Vector
Combine Two Data Frames by Rows (Rbind) When They Have Different Sets of Columns
Why Is '[' Better Than 'Subset'
For Each Row Return the Column Name of the Largest Value
Sum Across Multiple Columns With Dplyr
Aggregate/Summarize Multiple Variables Per Group (E.G. Sum, Mean)