Why Does Ruby /[[:Punct:]]/ Miss Some Punctuation Characters

How do I match `:punct:` except for some character?

From Ruby docs:

A character class may contain another character class. By itself this isn't useful because [a-z[0-9]] describes the same set as [a-z0-9]. However, character classes also support the && operator which performs set intersection on its arguments.

So, "punctuation but not apostrophe" is:

[[:punct:]&&[^']]

EDIT: By demand from revo in question comments, on my machine this benchmarks lookahead as ~10% slower, and lookbehind as ~20% slower:

require 'benchmark'

N = 1_000_000
STR = "Mr. O'Brien! Please don't go, Mr. O'Brien!"

def test(bm, re)
N.times {
STR.scan(re).size
}
end

Benchmark.bm do |bm|
bm.report("intersection") { test(bm, /[[:punct:]&&[^']]/) }
bm.report("lookahead") { test(bm, /(?!')[[:punct:]]/) }
bm.report("lookbehind") { test(bm, /[[:punct:]](?<!')/) }
end

Regex punct character class matches different characters depending on Ruby version

Ruby 1.9.3 used US_ASCII as its default encoding, which properly matched all punctuation. Ruby 2.0 switched its default encoding to UTF-8, introducing the bug you discovered, which causes punctuation to be improperly matched. Ruby 2.4 patched this bug.

The correct behavior would be to match all punctuation, as ruby 1.9.3 and 2.4 do. This is consistent with the POSIX regex definition for punctuation.

One choice for making your code consistent is to encode all strings as US_ASCII or an alternative which doesn't have the UTF-8 bug:

matched, unmatched = chars.partition { |c| c.encode(Encoding::US_ASCII) =~ /[[:punct:]]/ }

But that's probably not desirable because it forces you to use a restrictive encoding for your strings.

The other option is to manually define the punctuation:

/[!"\#$%&'()*+,\-./:;<=>?@\[\\\]^_`{|}~]/

It's somewhat inelegant, but you can throw it into a variable and add it to regexes that way:

punctuation = "[!\"\#$%&'()*+,\-./:;<=>?@\[\\\]^_`{|}~]"
my_regex = /#{punctuation}/

Regular expression in Ruby to capture Unicode punctuation marks?

\p{P}

- not just in Ruby.
See http://www.regular-expressions.info/unicode.html

Regex incorrectly matching punctuation (including spaces)

The regular expression you have got there does the following for as far as I understand (I'm not familiar with the ruby variety, and still quite new to regex myself; this will give you an idea, but may not be 100% correct):

  1. Go to the beginning of the string
  2. Ensure the string matches any number of any characters followed by a lowercase letter, e.g. --a
  3. Ensure the string matches any number of any characters followed by an uppercase letter, e.g.--aA
  4. Ensure the string matches any number of any characters followed by a number, e.g. --aA0
  5. If that is all true, make sure the beginning of the string is followed by at least 6 random characters, e.g.--aA0-
  6. Ensure that is followed by a single non-punctuation character (although this is the part I'm not sure about, as I haven't used character classes before, and don't know if it's [^[:punct:]] or [^:punct:]), e.g. --aA0-c
  7. Ensure that is followed directly by the end of the string

Now, the lookaheads would also allow a different order of occurrences, e.g. 0---Aa, as long as the string contains any characters followed by what they are looking for.

What you probably want is ^[a-zA-Z0-9]{6,}$, i.e. at least six characters, with the characters being letters and numbers (though that would also allow aaaaaa, for example).

Maybe try ^(?=.*[a-z])(?=.*[A-Z])(?=.*[0-9])[a-zA-Z0-9]{6,}$ to make sure each group is present, and to get alpha-numerical characters (at least six of them) only.

I always use a tool such as http://www.regexpal.com/ to slowly build up my regex and to see where I go wrong, deconstructing a "bad" regex until I get to a "good" one, then slowly adding to it again.

Hope that helps. :)

P.S.: I'm still a bit unclear how many characters you want to match in total, i.e. if the string is fixed length or not...?



Related Topics



Leave a reply



Submit