Remove all special characters from a string in R?
You need to use regular expressions to identify the unwanted characters. For the most easily readable code, you want the str_replace_all
from the stringr
package, though gsub
from base R works just as well.
The exact regular expression depends upon what you are trying to do. You could just remove those specific characters that you gave in the question, but it's much easier to remove all punctuation characters.
x <- "a1~!@#$%^&*(){}_+:\"<>?,./;'[]-=" #or whatever
str_replace_all(x, "[[:punct:]]", " ")
(The base R equivalent is gsub("[[:punct:]]", " ", x)
.)
An alternative is to swap out all non-alphanumeric characters.
str_replace_all(x, "[^[:alnum:]]", " ")
Note that the definition of what constitutes a letter or a number or a punctuatution mark varies slightly depending upon your locale, so you may need to experiment a little to get exactly what you want.
How to remove specific special characters in R
gsub("[^[:alnum:][:blank:]+?&/\\-]", "", c)
# [1] "In Acid-base reaction page4 why does it create water and not H+?"
Remove special characters from entire dataframe in R
Another solution is to convert the data frame to a matrix first then run the gsub and then convert back to a data frame as follows:
as.data.frame(gsub("[[:punct:]]", "", as.matrix(df)))
R how to remove VERY special characters in strings?
So, I'm going to go ahead and make an answer, because I believe this is what you're looking for:
> s = "who are í ½í¸€ bringing?"
> rmSpec <- "í|½|€" # The "|" designates a logical OR in regular expressions.
> s.rem <- gsub(rmSpec, "", s) # gsub replace any matches in remSpec and replace them with "".
> s.rem
[1] "who are ¸ bringing?"
Now, this does have the caveat that you have to manually define the special character in the rmSpec
variable. Not sure if you know what special characters to remove or if you're looking for a more general solution.
EDIT:
So it appears you almost had it with iconv
, you were just missing the sub
argument. See below:
> s
[1] "who are í ½í¸€ bringing?"
> s2 <- iconv(s, "UTF-8", "ASCII", sub = "")
> s2
[1] "who are bringing?"
R: Replace Special Characters
We can match one or more characters that are not alpbabets and replace it with "S"
df$Q2 <- sub("[^A-Za-z]+", "S", df$Q2)
df$Q2
#[1] "aSk" "aSk" "aSk"
Or we capture only the alphabetic characters as a group (([A-Za-z]*
) from the start (*
) of the string, match the following characters that are non-alphabets and replace with the backreference of the captured group followed by "S"
sub("^([A-Za-z]*)[^A-Za-z]+", "\\1S", df$Q2)
#[1] "aSk" "aSk" "aSk"
Remove special characters and numbers from column R
Remove X. and digits
str_remove_all(df$c, "[X.]|[:digit:]")
#> [1] "Int" "BI" "Int" "BI" "Int"
inside mutate:
df %>%
mutate(c = str_remove_all(c, "[X.]|[:digit:]"))
#> c d
#> 1 Int 4
#> 2 BI 1
#> 3 Int 2
#> 4 BI 3
#> 5 Int 5
Remove special characters in R from .docx
There are several things that make this hard:
- You want to replace characters by something that's generally the same, not just converting encoding. In your example, "<e1><b8><9d>" does not stand for an "e", it stand for a complicated version of an "e", meaning R won't just change it. But there are functions to do that
- It looks like
qdap.transcript
tries to be helpful. At least what you show here, and your results are consistent with, them not being special characters, but just literally being "<e1><b8><9d>". So if you try to remove special characters,gsub
happily complies, and removes the "<" and ">", leaving "e1" and so forth alone.
To solve your problem, I think you want to convert back to special characters, and then use stri_trans_general
from the stringi
package. I'm sure there are other likewise functions out there, but this one works for me. It turns out converting back to the special characters is the hard part, but I've got some working code:
library(stringi)
mystring <- 'If anyone knows how to simply change these special characters (i.e <e1><b8><9d> to e), again please feel free to update!'
pos <- gregexpr('(<[A-Fa-f0-9]{2}>)+', mystring)[[1]]
replace <- substring(mystring, pos, pos+attr(pos, 'match.length')-1)
replace <- sapply(replace, function(r) {
eval(parse(text=paste0('\'', gsub('>', '', gsub('<', '\\\\x', r)), '\'')))
})
for(i in seq_along(replace)) {
mystring <- sub('(<[A-Fa-f0-9]{2}>)+', replace[i], mystring)
}
mystring <- stri_trans_general(mystring, 'latin-ascii')
We first extract everything that looks like hexadecimals between "<" and ">", then convert them to literal "\xe1\xb8\x9d", and then ask R to process that, and replace the old values with those replacements.
Only at the last line we replace the special characters by (in this example) "e"
Removing special characters from a dataframe in R
We can loop over the columns, using gsub
match characters that are not -
or /
or .
or numbers and replace it with blanks (""
), assign the result back to the dataset and convert the second column to numeric
df1[] <- lapply(df1, function(x) gsub("[^-0-9/.]+", "", x))
df1[,2] <- as.numeric(df1[,2])
df1
# Date NAV
#1 03/08/2017 209.0537
#2 02/08/2017 208.7831
#3 01/08/2017 208.7373
If this needs to be converted to xts
library(xts)
xts(df1[-1], order.by = as.Date(df1$Date, "%m/%d/%Y"))
# NAV
#2017-01-08 208.7373
#2017-02-08 208.7831
#2017-03-08 209.0537
data
df1 <- structure(list(Date = structure(c(3L, 2L, 1L), .Label = c("=\"01/08/2017\"",
"=\"02/08/2017\"", "=\"03/08/2017\""), class = "factor"), NAV = structure(c(3L,
2L, 1L), .Label = c("=\"€208.7373\"", "=\"€208.7831\"",
"=\"€209.0537\""
), class = "factor")), .Names = c("Date", "NAV"), row.names = c(NA,
-3L), class = "data.frame")
Removing Special Characters in a Text File in R
gsub("[@#]([a-zA-Z]+)[@#]", "\\1", x)
Related Topics
Importing Wikipedia Tables in R
Linear Interpolate Missing Values in Time Series
R Ggplot2 Center Align a Multi-Line Title
How to Reset All Options() Arguments to Their Default Values
Check If Value Is in Data Frame
Ggplot Object Not Found Error When Adding Layer with Different Data
Taking a Disproportionate Sample from a Dataset in R
How to Do Gaussian Elimination in R (Do Not Use "Solve")
Evaluate Inline R Code in Rmarkdown Figure Caption
Why Are Lubridate Functions So Slow When Compared with As.Posixct
How Can a Script Find Itself in R Running from the Command Line
Stacke Different Plots in a Facet Manner
Sine Curve Fit Using Lm and Nls in R
How to Use Aws Cli to Only Copy Files in S3 Bucket That Match a Given String Pattern
Ggplot2 Aes_String() Fails to Handle Names Starting with Numbers or Containing Spaces
Easiest Way to Discretize Continuous Scales for Ggplot2 Color Scales