Regex "Punct" Character Class Matches Different Characters Depending on Ruby Version

Regex punct character class matches different characters depending on Ruby version

Ruby 1.9.3 used US_ASCII as its default encoding, which properly matched all punctuation. Ruby 2.0 switched its default encoding to UTF-8, introducing the bug you discovered, which causes punctuation to be improperly matched. Ruby 2.4 patched this bug.

The correct behavior would be to match all punctuation, as ruby 1.9.3 and 2.4 do. This is consistent with the POSIX regex definition for punctuation.

One choice for making your code consistent is to encode all strings as US_ASCII or an alternative which doesn't have the UTF-8 bug:

matched, unmatched = chars.partition { |c| c.encode(Encoding::US_ASCII) =~ /[[:punct:]]/ }

But that's probably not desirable because it forces you to use a restrictive encoding for your strings.

The other option is to manually define the punctuation:

/[!"\#$%&'()*+,\-./:;<=>?@\[\\\]^_`{|}~]/

It's somewhat inelegant, but you can throw it into a variable and add it to regexes that way:

punctuation = "[!\"\#$%&'()*+,\-./:;<=>?@\[\\\]^_`{|}~]"
my_regex = /#{punctuation}/

Regex slightly different in Ruby 2?

The regular expression engine has been changed to Onigmo (based on Oniguruma) and this might be causing issues.

As far as I can tell, you're declaring the regular expression incorrectly. The second set of brackets is not required:

/[^[:space:]\d\-,\.]/

The [:space:] declaration is only invalid inside of a set so you will see it appear as [[:space:]] if used in isolation. In your case you have several other additions to the set.

I'm not sure why \s would not have sufficed in this case.

Regex incorrectly matching punctuation (including spaces)

The regular expression you have got there does the following for as far as I understand (I'm not familiar with the ruby variety, and still quite new to regex myself; this will give you an idea, but may not be 100% correct):

  1. Go to the beginning of the string
  2. Ensure the string matches any number of any characters followed by a lowercase letter, e.g. --a
  3. Ensure the string matches any number of any characters followed by an uppercase letter, e.g.--aA
  4. Ensure the string matches any number of any characters followed by a number, e.g. --aA0
  5. If that is all true, make sure the beginning of the string is followed by at least 6 random characters, e.g.--aA0-
  6. Ensure that is followed by a single non-punctuation character (although this is the part I'm not sure about, as I haven't used character classes before, and don't know if it's [^[:punct:]] or [^:punct:]), e.g. --aA0-c
  7. Ensure that is followed directly by the end of the string

Now, the lookaheads would also allow a different order of occurrences, e.g. 0---Aa, as long as the string contains any characters followed by what they are looking for.

What you probably want is ^[a-zA-Z0-9]{6,}$, i.e. at least six characters, with the characters being letters and numbers (though that would also allow aaaaaa, for example).

Maybe try ^(?=.*[a-z])(?=.*[A-Z])(?=.*[0-9])[a-zA-Z0-9]{6,}$ to make sure each group is present, and to get alpha-numerical characters (at least six of them) only.

I always use a tool such as http://www.regexpal.com/ to slowly build up my regex and to see where I go wrong, deconstructing a "bad" regex until I get to a "good" one, then slowly adding to it again.

Hope that helps. :)

P.S.: I'm still a bit unclear how many characters you want to match in total, i.e. if the string is fixed length or not...?

Select a string in regex with ruby

You can try an alternative approach: matching everything you want to keep then joining the result.

You can use this regex to match everything you want to keep:

[A-Z\d+| ^]|<?=>

As you can see this is just a using | and [] to create a list of strings that you want to keep: uppercase, numbers, +, |, space, ^, => and <=>.

Example:

"aA azee + B => C=".scan(/[A-Z\d+| ^]|<?=>/).join()

Output:

"A  + B => C"

Note that there are 2 consecutive spaces between "A" and "+". If you don't want that you can call String#squeeze.

Testing for word characters in Ruby/Rails regular expressions for all languages

Yes. Definitely on the right track with :alpha: Here's a locale aware example from (https://stackoverflow.com/a/3879835/499581):

/\A[[:alpha:]]+\Z/

also for certain punctuation consider using:

/[[:punct:]]/

more here.

How to get the Ruby Regexp punctuation list from the `[[:punct:]]`?

[[:punct:]] refers to what is considered punctuation in unicode. For example: https://www.fileformat.info/info/unicode/category/Po/list.htm

s = "foo\u1368bar" # => "foo፨bar"
s.split(/[[:punct:]]/) # => ["foo", "bar"]

Sorry but my question is about to get that list using Ruby.

For the lack of a better idea, you can always loop from 1 to whatever is the maximum character number in unicode now, treat that as a character code, generate one-char string and match it against [[:punct:]] regex. Here's the quick and dirty implementation

punct = 1.upto(65535).map do |x|
x.chr(Encoding::UTF_8)
rescue RangeError
nil
end.reject(&:nil?).select do |s|
s =~ /[[:punct:]]/
end

Result (as displayed by my macos):

unicode punctuation



Related Topics



Leave a reply



Submit