Remove all punctuation except apostrophes in R
x <- "I like %$@to*&, chew;: gum, but don't like|}{[] bubble@#^)( gum!?"
gsub("[^[:alnum:][:space:]']", "", x)
[1] "I like to chew gum but don't like bubble gum"
The above regex is much more straight forward. It replaces everything that's not alphanumeric signs, space or apostrophe (caret symbol!) with an empty string.
R regex remove all punctuation except apostrophe
A "negative lookahead assertion" can be used to remove from consideration any apostrophes, before they are even tested for being punctuation characters.
gsub("(?!')[[:punct:]]", "", str2, perl=TRUE)
# [1] "this doesn't not have an apostrophe"
Remove punctuation from text (except the symbol &)
What about doing the inverse? i.e. replacing everything that is not a letter, a digit or a &
with an empty string:
gsub("[^[:alnum:][:space:]&]", "", data)
# [1] "Type the command AT&W enter in order to save the new protocol on modem"
in R, use gsub to remove all punctuation except period
You can put back some matches like this:
sub("([.-])|[[:punct:]]", "\\1", as.matrix(z))
X..1. X..2.
[1,] "1" "6"
[2,] "2" "7.235"
[3,] "3" "8"
[4,] "4" "9"
[5,] "5" "-10"
Here I am keeping the .
and -
.
And I guess , the next step is to coerce you result to a numeric matrix, SO here I combine the 2 steps like this:
matrix(as.numeric(sub("([.-])|[[:punct:]]", "\\1", as.matrix(z))),ncol=2)
[,1] [,2]
[1,] 1 6.000
[2,] 2 7.235
[3,] 3 8.000
[4,] 4 9.000
[5,] 5 -10.000
Removing punctuation except for apostrophes AND intra-word dashes with gsub in R WITHOUT accidently concatenating two words
You can go as far as leaving only leading/trailing whitespace with one function:
gsub("[[:punct:]]* *(\\w+[&'-]\\w+)|[[:punct:]]+ *| {2,}", " \\1", x)
# [1] "Good luck SPRINT I like good deals I can't lie brand-new stuff excites me got to say yo At&t why a dash apostrophe's I can do all-day But preventing concatenating is a new ballgame but why not "
If you're able to use the qdapRegex package, you could do:
library(qdapRegex)
rm_default(x, pattern = "[^ a-zA-Z&'-]|[&'-]{2,}", replacement = " ")
# [1] "Good luck SPRINT I like good deals I can't lie brand-new stuff excites me got to say yo At&t why a dash apostrophe's I can do all-day But preventing concatenating is a new ballgame but why not"
How to remove punctuation excluding negations?
We can do it in two steps, remove all punctuation excluding "'"
, then remove "'s"
using fixed match:
gsub("'s", "", gsub("[^[:alnum:][:space:]']", "", s), fixed = TRUE)
Regex; eliminate all punctuation except
It's not clear to me what you want the result to be, but you might be able to use negative classes like this answer.
R> strsplit(X, "[[:space:]]|(?=[^,'[:^punct:]])", perl=TRUE)[[1]]
[1] "I'm" "not" "that" "good" "at" "regex" "yet,"
[8] "but" "am" "getting" "better" "!"
Remove all punctuation except underline between characters in R with POSIX character class
You can use
gsub("[^_[:^punct:]]|_+\\b|\\b_+", "", test, perl=TRUE)
See the regex demo
Details:
[^_[:^punct:]]
- any punctuation except_
|
- or_+\b
- one or more_
at the end of a word|
- or\b_+
- one or more_
at the start of a word
Related Topics
How to Use Objects from Global Environment in Rstudio Markdown
Installation of Rodbc/Roracle Packages on Os X Mavericks
Sort Columns of a Dataframe by Column Name
Ggplot2 Multiple Sub Groups of a Bar Chart
What Are the R Sorting Rules of Character Vectors
How to Create a Marimekko/Mosaic Plot in Ggplot2
Combining Bar and Line Chart (Double Axis) in Ggplot2
More Than One Value for "Each" Argument in "Rep" Function
How to Loop/Repeat a Linear Regression in R
Options for Caching/Memoization/Hashing in R
Dealing with True, False, Na and Nan
Run a for Loop in Parallel in R
Converting Two Columns of a Data Frame to a Named Vector
How to Generate All Possible Combinations of Vectors Without Caring for Order
What Does the Capital Letter "I" in R Linear Regression Formula Mean