I Want to Match All Punctuation in My Regexp Except Apostrophes. How to Do That in Ruby

I want to match all punctuation in my regexp except apostrophes. How do i do that in Ruby?

string = "jack. o'reilly? mike??!?"
puts string.gsub(/[\p{P}&&[^']]/, '')
# => jack o'reilly mike

Docs:

A character class may contain another character class. By itself this isn’t useful because [a-z[0-9]] describes the same set as [a-z0-9]. However, character classes also support the && operator which performs set intersection on its arguments.

So, [\p{P}&&[^']] is "any character that is punctuation and also not an apostrophe".

How do I match `:punct:` except for some character?

From Ruby docs:

A character class may contain another character class. By itself this isn't useful because [a-z[0-9]] describes the same set as [a-z0-9]. However, character classes also support the && operator which performs set intersection on its arguments.

So, "punctuation but not apostrophe" is:

[[:punct:]&&[^']]

EDIT: By demand from revo in question comments, on my machine this benchmarks lookahead as ~10% slower, and lookbehind as ~20% slower:

require 'benchmark'

N = 1_000_000
STR = "Mr. O'Brien! Please don't go, Mr. O'Brien!"

def test(bm, re)
N.times {
STR.scan(re).size
}
end

Benchmark.bm do |bm|
bm.report("intersection") { test(bm, /[[:punct:]&&[^']]/) }
bm.report("lookahead") { test(bm, /(?!')[[:punct:]]/) }
bm.report("lookbehind") { test(bm, /[[:punct:]](?<!')/) }
end

R regex remove all punctuation except apostrophe

A "negative lookahead assertion" can be used to remove from consideration any apostrophes, before they are even tested for being punctuation characters.

gsub("(?!')[[:punct:]]", "", str2, perl=TRUE)
# [1] "this doesn't not have an apostrophe"

Remove all special char except apostrophe

Maybe something like:

string.scan(/\b[\w']+\b/i).each_with_object(Hash.new(0)){|a,(k,v)| k[a]+=1}

The regex employs word boundaries (\b).
The scan outputs an array of the found words and for each word in the array they are added to the hash, which has a default value of zero for each item which is then incremented.

Turns out my solution whilst finding all items and ignoring case will still leave the items in the case they were found in originally.
This would now be a decision for Nelly to either accept as is or to perform a downcase on the original string or the array item as it is added to the hash.

I'll leave that decision up to you :)

How to fix this regex so that it returns only punctuation and words containing punctuation?

I realized that my requirements were a bit more involved than at the time that I posted.

I would need to match partially hyphenated words (e.g. "-fast") as well as just 's and even "pay-as-you-go".

So I found the following regex to work.

regex = /\w*['-]\w*[-]*\w*[-]*\w*|[[:punct:]]+/

string = "The man, had a big-cat that his Sister's aunt gave him and was -fast 's very-very-big-cat.!!"

The sentence does not make much sense but includes some good examples of the words with punctuation and punctuation I need to match.

string.scan(regex)

=> [",", "big-cat", "Sister's", "-fast", "'s", "very-very-big-cat", ".!!"]

There may be ways to improve the way the regex is written but it's the best I can do that gets the results I need.

Removing all punctuation apart from single apostrophes and hyphens within words

I think this does it:

gsub('( |^)-+|-+( |$)', '\\1', gsub("[^ [:alnum:]'-]", '', str1))
#[1] "I'm dash before word word dash in-between word two before word word just dashes between words word word"


Related Topics



Leave a reply



Submit