R Remove Non-Alphanumeric Symbols from a String

R remove non-alphanumeric symbols from a string

here is an example:

> str <- "This is a string. In addition, this is a string!"
> str
[1] "This is a string. In addition, this is a string!"
> strsplit(gsub("[^[:alnum:] ]", "", str), " +")[[1]]
[1] "This" "is" "a" "string" "In" "addition" "this" "is" "a"
[10] "string"

How can I remove non-numeric characters from strings using gsub in R?

Simply use

gsub("[^0-9.-]", "", x)

You can in case of multiple - and . have a second regEx dealing with that.
If you struggle with it, open a new question.


(Make sure to change . with , if needed)

Remove all special characters from a string in R?

You need to use regular expressions to identify the unwanted characters. For the most easily readable code, you want the str_replace_all from the stringr package, though gsub from base R works just as well.

The exact regular expression depends upon what you are trying to do. You could just remove those specific characters that you gave in the question, but it's much easier to remove all punctuation characters.

x <- "a1~!@#$%^&*(){}_+:\"<>?,./;'[]-=" #or whatever
str_replace_all(x, "[[:punct:]]", " ")

(The base R equivalent is gsub("[[:punct:]]", " ", x).)

An alternative is to swap out all non-alphanumeric characters.

str_replace_all(x, "[^[:alnum:]]", " ")

Note that the definition of what constitutes a letter or a number or a punctuatution mark varies slightly depending upon your locale, so you may need to experiment a little to get exactly what you want.

keep only alphanumeric characters and space in a string using gsub

You could use the classes [:alnum:] and [:space:] for this:

sample_string <- "�+ Sample 2 string here =�{�>E�BH�P<]�{�>"
gsub("[^[:alnum:][:space:]]","",sample_string)
#> [1] "ï Sample 2 string here ïïEïBHïPïï"

Alternatively you can use PCRE codes to refer to specific character sets:

gsub("[^\\p{L}0-9\\s]","",sample_string, perl = TRUE)
#> [1] "ï Sample 2 string here ïïEïBHïPïï"

Both cases illustrate clearly that the characters still there, are considered letters. Also the EBHP inside are still letters, so the condition on which you're replacing is not correct. You don't want to keep all letters, you just want to keep A-Z, a-z and 0-9:

gsub("[^A-Za-z0-9 ]","",sample_string)
#> [1] " Sample 2 string here EBHP"

This still contains the EBHP. If you really just want to keep a section that contains only letters and numbers, you should use the reverse logic: select what you want and replace everything but that using backreferences:

gsub(".*?([A-Za-z0-9 ]+)\\s.*","\\1", sample_string)
#> [1] " Sample 2 string here "

Or, if you want to find a string, even not bound by spaces, use the word boundary \\b instead:

gsub(".*?(\\b[A-Za-z0-9 ]+\\b).*","\\1", sample_string)
#> [1] "Sample 2 string here"

What happens here:

  • .*? fits anything (.) at least 0 times (*) but ungreedy (?). This means that gsub will try to fit the smallest amount possible by this piece.
  • everything between () will be stored and can be refered to in the replacement by \\1
  • \\b indicates a word boundary
  • This is followed at least once (+) by any character that's A-Z, a-z, 0-9 or a space. You have to do it that way, because the special letters are contained in between the upper and lowercase in the code table. So using A-z will include all special letters (which are UTF-8 btw!)
  • after that sequence,fit anything at least zero times to remove the rest of the string.
  • the backreference \\1 in combination with .* in the regex, will make sure only the required part remains in the output.

How can I select all records with non-alphanumeric and remove them?

I suggest using REGEXP_REPLACE for select, to remove the characters, and using REGEXP_CONTAINS to get only the one you want.

SELECT REGEXP_REPLACE(EMPLOYER, r'[^a-zA-Z\d\s]', '') 
FROM fec.work
WHERE REGEXP_CONTAINS(EMPLOYER, r'[^a-zA-Z\d\s]')

You say you don't want to use replace because you don't know how many alphanumerical there is. But instead of listing all non-alphanumerical, why not use ^ to get all but alphanumerical ?

EDIT :

To complete with what Mikhail answered, you have multiple choices for your regex :

'[^a-zA-Z\\d\\s]'  // Basic regex
r'[^a-zA-Z\d\s]' // Uses r to avoid escaping
r'[^\w\s]' // \w = [a-zA-Z0-9_] (! underscore as alphanumerical !)

If you don't consider underscores to be alphanumerical, you should not use \w

Removing non-alpanumeric characters from an ordered collection of objects (list) in R

Strongly recommend you simply use

gsub("[^a-zA-Z0-9]","",x)

where x is the name of the list.

You probably included the foreign characters at the end of the list because you want these obliterating too - well, the above command achieves this. To explain briefly, the square brackets in the command define a collection of symbols, and the ^ symbol means "not", so everything that is not in the specified set of 62 characters (lower case a to z, upper case A to Z, and digits 0 to 9) will be replaced by the empty string "" (i.e. destroyed).

And here's the output...

 [1] ""                             ""                        ""
[4] "" "" ""
[7] "" "Home" ""
[10] "Expertise" "QuestionResearchDesign" ""
[13] "SurveyDevelopmentValidation" "" "DataProcessing"
[16] "" "StatisticalAnalysis" ""
[19] "PublicationsGrants" "" "Evaluation"
[22] "" "" "ConsultingAreas"
[25] "Business" "" "Education"
[28] "K12" "" ""
[31] "" ""

How do I remove all non alphanumeric characters from a string except dash?

Replace [^a-zA-Z0-9 -] with an empty string.

Regex rgx = new Regex("[^a-zA-Z0-9 -]");
str = rgx.Replace(str, "");


Related Topics



Leave a reply



Submit