R remove non-alphanumeric symbols from a string
here is an example:
> str <- "This is a string. In addition, this is a string!"
> str
[1] "This is a string. In addition, this is a string!"
> strsplit(gsub("[^[:alnum:] ]", "", str), " +")[[1]]
[1] "This" "is" "a" "string" "In" "addition" "this" "is" "a"
[10] "string"
How can I remove non-numeric characters from strings using gsub in R?
Simply use
gsub("[^0-9.-]", "", x)
You can in case of multiple -
and .
have a second regEx dealing with that.
If you struggle with it, open a new question.
(Make sure to change .
with ,
if needed)
Remove all special characters from a string in R?
You need to use regular expressions to identify the unwanted characters. For the most easily readable code, you want the str_replace_all
from the stringr
package, though gsub
from base R works just as well.
The exact regular expression depends upon what you are trying to do. You could just remove those specific characters that you gave in the question, but it's much easier to remove all punctuation characters.
x <- "a1~!@#$%^&*(){}_+:\"<>?,./;'[]-=" #or whatever
str_replace_all(x, "[[:punct:]]", " ")
(The base R equivalent is gsub("[[:punct:]]", " ", x)
.)
An alternative is to swap out all non-alphanumeric characters.
str_replace_all(x, "[^[:alnum:]]", " ")
Note that the definition of what constitutes a letter or a number or a punctuatution mark varies slightly depending upon your locale, so you may need to experiment a little to get exactly what you want.
keep only alphanumeric characters and space in a string using gsub
You could use the classes [:alnum:]
and [:space:]
for this:
sample_string <- "�+ Sample 2 string here =�{�>E�BH�P<]�{�>"
gsub("[^[:alnum:][:space:]]","",sample_string)
#> [1] "ï Sample 2 string here ïïEïBHïPïï"
Alternatively you can use PCRE codes to refer to specific character sets:
gsub("[^\\p{L}0-9\\s]","",sample_string, perl = TRUE)
#> [1] "ï Sample 2 string here ïïEïBHïPïï"
Both cases illustrate clearly that the characters still there, are considered letters. Also the EBHP inside are still letters, so the condition on which you're replacing is not correct. You don't want to keep all letters, you just want to keep A-Z, a-z and 0-9:
gsub("[^A-Za-z0-9 ]","",sample_string)
#> [1] " Sample 2 string here EBHP"
This still contains the EBHP. If you really just want to keep a section that contains only letters and numbers, you should use the reverse logic: select what you want and replace everything but that using backreferences:
gsub(".*?([A-Za-z0-9 ]+)\\s.*","\\1", sample_string)
#> [1] " Sample 2 string here "
Or, if you want to find a string, even not bound by spaces, use the word boundary \\b
instead:
gsub(".*?(\\b[A-Za-z0-9 ]+\\b).*","\\1", sample_string)
#> [1] "Sample 2 string here"
What happens here:
.*?
fits anything (.) at least 0 times (*) but ungreedy (?). This means that gsub will try to fit the smallest amount possible by this piece.- everything between
()
will be stored and can be refered to in the replacement by\\1
\\b
indicates a word boundary- This is followed at least once (+) by any character that's A-Z, a-z, 0-9 or a space. You have to do it that way, because the special letters are contained in between the upper and lowercase in the code table. So using
A-z
will include all special letters (which are UTF-8 btw!) - after that sequence,fit anything at least zero times to remove the rest of the string.
- the backreference
\\1
in combination with.*
in the regex, will make sure only the required part remains in the output.
How can I select all records with non-alphanumeric and remove them?
I suggest using REGEXP_REPLACE
for select, to remove the characters, and using REGEXP_CONTAINS
to get only the one you want.
SELECT REGEXP_REPLACE(EMPLOYER, r'[^a-zA-Z\d\s]', '')
FROM fec.work
WHERE REGEXP_CONTAINS(EMPLOYER, r'[^a-zA-Z\d\s]')
You say you don't want to use replace
because you don't know how many alphanumerical there is. But instead of listing all non-alphanumerical, why not use ^
to get all but alphanumerical ?
EDIT :
To complete with what Mikhail answered, you have multiple choices for your regex :
'[^a-zA-Z\\d\\s]' // Basic regex
r'[^a-zA-Z\d\s]' // Uses r to avoid escaping
r'[^\w\s]' // \w = [a-zA-Z0-9_] (! underscore as alphanumerical !)
If you don't consider underscores to be alphanumerical, you should not use \w
Removing non-alpanumeric characters from an ordered collection of objects (list) in R
Strongly recommend you simply use
gsub("[^a-zA-Z0-9]","",x)
where x is the name of the list.
You probably included the foreign characters at the end of the list because you want these obliterating too - well, the above command achieves this. To explain briefly, the square brackets in the command define a collection of symbols, and the ^ symbol means "not", so everything that is not in the specified set of 62 characters (lower case a to z, upper case A to Z, and digits 0 to 9) will be replaced by the empty string "" (i.e. destroyed).
And here's the output...
[1] "" "" ""
[4] "" "" ""
[7] "" "Home" ""
[10] "Expertise" "QuestionResearchDesign" ""
[13] "SurveyDevelopmentValidation" "" "DataProcessing"
[16] "" "StatisticalAnalysis" ""
[19] "PublicationsGrants" "" "Evaluation"
[22] "" "" "ConsultingAreas"
[25] "Business" "" "Education"
[28] "K12" "" ""
[31] "" ""
How do I remove all non alphanumeric characters from a string except dash?
Replace [^a-zA-Z0-9 -]
with an empty string.
Regex rgx = new Regex("[^a-zA-Z0-9 -]");
str = rgx.Replace(str, "");
Related Topics
Fastest Way to Read in 100,000 .Dat.Gz Files
Can Ggplot Theme Formatting Be Saved as an Object
Dependency 'Slam' Is Not Available When Installing Tm Package
Convert Matrix to Three Column Data.Frame
How Can R Loop Over Data Frames
R Library for Discrete Markov Chain Simulation
Function for Retrieving Own Ip Address from Within R
Using ':=' in Data.Table to Sum the Values of Two Columns in R, Ignoring Nas
Are Recursive Functions Used in R
Can R Read from a File Through an Ssh Connection
Faster Way to Subset on Rows of a Data Frame in R
Summarise_At Using Different Functions for Different Variables
Population Pyramid Density Plot in R