Remove All Punctuation Except Apostrophes in R

Remove all punctuation except apostrophes in R

x <- "I like %$@to*&, chew;: gum, but don't like|}{[] bubble@#^)( gum!?"
gsub("[^[:alnum:][:space:]']", "", x)

[1] "I like to chew gum but don't like bubble gum"

The above regex is much more straight forward. It replaces everything that's not alphanumeric signs, space or apostrophe (caret symbol!) with an empty string.

R regex remove all punctuation except apostrophe

A "negative lookahead assertion" can be used to remove from consideration any apostrophes, before they are even tested for being punctuation characters.

gsub("(?!')[[:punct:]]", "", str2, perl=TRUE)
# [1] "this doesn't not have an apostrophe"

Remove punctuation from text (except the symbol &)

What about doing the inverse? i.e. replacing everything that is not a letter, a digit or a & with an empty string:

gsub("[^[:alnum:][:space:]&]", "", data)
# [1] "Type the command AT&W enter in order to save the new protocol on modem"

in R, use gsub to remove all punctuation except period

You can put back some matches like this:

 sub("([.-])|[[:punct:]]", "\\1", as.matrix(z))
X..1. X..2.
[1,] "1" "6"
[2,] "2" "7.235"
[3,] "3" "8"
[4,] "4" "9"
[5,] "5" "-10"

Here I am keeping the . and -.

And I guess , the next step is to coerce you result to a numeric matrix, SO here I combine the 2 steps like this:

matrix(as.numeric(sub("([.-])|[[:punct:]]", "\\1", as.matrix(z))),ncol=2)
[,1] [,2]
[1,] 1 6.000
[2,] 2 7.235
[3,] 3 8.000
[4,] 4 9.000
[5,] 5 -10.000

Removing punctuation except for apostrophes AND intra-word dashes with gsub in R WITHOUT accidently concatenating two words

You can go as far as leaving only leading/trailing whitespace with one function:

gsub("[[:punct:]]* *(\\w+[&'-]\\w+)|[[:punct:]]+ *| {2,}", " \\1", x)
# [1] "Good luck SPRINT I like good deals I can't lie brand-new stuff excites me got to say yo At&t why a dash apostrophe's I can do all-day But preventing concatenating is a new ballgame but why not "

If you're able to use the qdapRegex package, you could do:

library(qdapRegex)
rm_default(x, pattern = "[^ a-zA-Z&'-]|[&'-]{2,}", replacement = " ")
# [1] "Good luck SPRINT I like good deals I can't lie brand-new stuff excites me got to say yo At&t why a dash apostrophe's I can do all-day But preventing concatenating is a new ballgame but why not"

How to remove punctuation excluding negations?

We can do it in two steps, remove all punctuation excluding "'", then remove "'s" using fixed match:

gsub("'s", "", gsub("[^[:alnum:][:space:]']", "", s), fixed = TRUE)

Regex; eliminate all punctuation except

It's not clear to me what you want the result to be, but you might be able to use negative classes like this answer.

R> strsplit(X, "[[:space:]]|(?=[^,'[:^punct:]])", perl=TRUE)[[1]]
[1] "I'm" "not" "that" "good" "at" "regex" "yet,"
[8] "but" "am" "getting" "better" "!"

Remove all punctuation except underline between characters in R with POSIX character class

You can use

gsub("[^_[:^punct:]]|_+\\b|\\b_+", "", test, perl=TRUE)

See the regex demo

Details:

  • [^_[:^punct:]] - any punctuation except _
  • | - or
  • _+\b - one or more _ at the end of a word
  • | - or
  • \b_+ - one or more _ at the start of a word


Related Topics



Leave a reply



Submit