Differences Between Ruby 1.9 and JavaScript Regexp

Regular Expression performing poorly in Ruby compared to JavaScript

Don't know why regex parser from 1.8.7 is so much slower than the one from JS or Oniguruma from 1.9.2, but may be this particular regex can benefit from wrapping its prefix including @ symbol with atomic group like that:

EMAIL_REGEXP = /
^
(?>(( # atomic group start
([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+
(\.([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+)*
)
|
(
(\x22)
(
(((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?
(
([\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|\x21|[\x23-\x5b]|[\x5d-\x7e]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
|
(\\([\x01-\x09\x0b\x0c\x0d-\x7f]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]))
)
)*
(((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?
(\x22)
)
)
@) # atomic group end
(
(
([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
|
(
([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*
([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
)
)
\.
)+
(
([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
|
(
([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*
([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
)
)
$
/xi

puts "aaaaa.bbbbbb.ccccc.ddddd@gmail.com".match EMAIL_REGEXP # returns immediately
puts "aaaaa.bbbbbb.ccccc.ddddd@gmail".match EMAIL_REGEXP # takes a long time

Atomic group in that case should prevent parser from returning to the first part of the string when matching the part following @ symbol failed. And it gives significant speed-up. Though, I'm not 100% sure that it doesn't break regexp logic, so I'd appreciate any comments.

Another thing is using non-capturing groups that should be faster in general when you don't need to backreference for groups, but they don't give any noticeable improvement in this case.

How do I validate a regular expression in JavaScript using Ruby's rules?

The differences between Ruby's regex syntax and JavaScript's can be found here: Differences between Ruby 1.9 and Javascript regexp

I suggest taking the input string, removing all Ruby features (e.g. (?#comments...)) and testing whether it's a valid regex in JavaScript:

try {
RegExp(input); valid = true;
} catch(e) {
valid = false;
}

Removing all Ruby features from the input string will be the tricky bit.

I can't think of an easier way without have some super-long ruby-regex-validating-regexp to hand.

Ruby 1.9.3 Regex utf8 \w accented characters

Try

'ein grüner Hund'.scan(/[[:word:]]+/u)

Documentation

Ruby 1.9 regular expression to match (un)?quoted key-value assignment

It's probably possible to do in one regex pattern, but I am a believer in keeping the patterns simple. Regex can be insidious and hide lots of little errors. Keep it simple to avoid that, then tweak afterwards.

text = <<EOT
RAILS_ENV=production
listen_address = 127.0.0.1 # localhost only by default
PATH="/usr/local/bin"
EOT

text.scan(/^([^=]+)=(.+)/)
# => [["RAILS_ENV", "production"], ["listen_address ", " 127.0.0.1 # localhost only by default"], ["PATH", "\"/usr/local/bin\""]]

To trim off the trailing comment is easy in a subsequent map:

text.scan(/^([^=]+)=(.+)/).map{ |n,v| [ n, v.sub(/#.+/, '') ] }
# => [["RAILS_ENV", "production"], ["listen_address ", " 127.0.0.1 "], ["PATH", "\"/usr/local/bin\""]]

If you want to normalize all your name/values so they have no extraneous spaces you can do that in the map also:

text.scan(/^([^=]+)=(.+)/).map{ |n,v| [ n.strip, v.sub(/#.+/, '').strip ] }
=> [["RAILS_ENV", "production"], ["listen_address", "127.0.0.1"], ["PATH", "\"/usr/local/bin\""]]

What the regex "/^([^=]+)=(.+)/" is doing is:

  1. "^" is "At the beginning of a line", which is the character after a "\n". This is not the same as the start of a string, which would be \A. There is an important difference so if you don't understand the two it is a good idea to learn when and why you'd want to use one over the other. That's one of those places a regex can be insidious.
  2. "([^=]+)" is "Capture everything that is not an equal-sign".
  3. "=" is obviously the equal-sign we were looking for in the previous step.
  4. "(.+)" is going to capture everything after the equal-sign.

I purposely kept the above pattern simple. For production use I'd tighten up the patterns a little using some "non-greedy" flags, along with a trailing "$" anchor:

text.scan(/^([^=]+?)=(.+)$/).map{ |n,v| [ n.strip, v.sub(/#.+/, '').strip ] }
=> [["RAILS_ENV", "production"], ["listen_address", "127.0.0.1"], ["PATH", "\"/usr/local/bin\""]]
  1. +? means find the first matching '='. It's already implied by the use of [^=] but +? makes that even more obvious to be my intent. I can get away without the ? but it's more of a self-documentation thing for later maintenance. In your use-case it should be benign but is a worthy thing to keep in your Regex Bag 'o Tricks.
  2. $ means the end-of-the-string, i.e., the place immediately preceding the EOL, AKA end-of-line, or carriage-return. It's implied also, but inserting it in the pattern makes it more obvious that's what I'm searching for.

EDIT to track the OP's added test:

text = <<EOT
RAILS_ENV=production
listen_address = 127.0.0.1 # localhost only by default
PATH="/usr/local/bin"
HOSTNAME=`cat /etc/hostname`
EOT

text.scan( /^ ( [^=]+? ) = ( .+ ) $/x ).map{ |n,v| [ n.strip, v.sub(/#.+/, '').strip ] }
=> [["RAILS_ENV", "production"], ["listen_address", "127.0.0.1"], ["PATH", "\"/usr/local/bin\""], ["HOSTNAME", "`cat /etc/hostname`"]]

If I was writing this for myself I'd generate a hash for convenience:

Hash[ text.scan( /^ ( [^=]+? ) = ( .+ ) $/x ).map{ |n,v| [ n.strip, v.sub(/#.+/, '').strip ] } ]
=> {"RAILS_ENV"=>"production", "listen_address"=>"127.0.0.1", "PATH"=>"\"/usr/local/bin\"", "HOSTNAME"=>"`cat /etc/hostname`"}

Ruby 1.9 regex as a hash key

It will not work without some extra code, as it is you are comparing a Regexp object with either an Integer or a String object. They won't be value equal, nor identity equal. They would match but that requires changes to the Hash class code.

irb(main):001:0> /(\d+)/.class
=> Regexp
irb(main):002:0> 2222.class
=> Fixnum
irb(main):003:0> '2222'.class
=> String
irb(main):004:0> /(\d+)/==2222
=> false
irb(main):007:0> /(\d+)/=='2222'
=> false
irb(main):009:0> /(\d+)/.equal?'2222'
=> false
irb(main):010:0> /(\d+)/.equal?2222
=> false

you would have to iterate the hash and use =~ in something like:

 hash.each do |k,v|    
unless (k=~whatever.to_s).nil?
puts v
end
end

or change the Hash class to try =~ in addition to the normal matching conditions. (I think that last option would be difficult, in mri the Hash class seems to have a lot of C code)

What's the difference between /\p{Alpha}/i and /\p{L}/i in ruby?

They seem to be equivalent. (Edit: sometimes, see the end of this answer)

It seems like Ruby supports \p{Alpha} since version 1.9. In POSIX \p{Alpha} is equal to \p{L&} (for regular expressions with Unicode support; see here). This matches all characters that have an upper and lower case variant (see here). Unicase letters would not be matched (while they would be match by \p{L}.

This does not seem to be true for Ruby (I picked a random Arabic character, since Arabic has a unicase alphabet):

  • \p{L} (any letter) matches.
  • Case-sensitive classes \p{Lu}, \p{Ll}, \p{Lt} don't match. As expected.
  • p{L&} doesn't match. As expected.
  • \p{Alpha} matches.

Which seems to be a very good indication that \p{Alpha} is just an alias for \p{L} in Ruby. On Rubular you can also see that \p{Alpha} was not available in Ruby 1.8.7.

Note that the i modifier is irrelevant in any case, because both \p{Alpha} and \p{L} match both upper- and lower-case characters anyway.

EDIT:

A ha, there is a difference! I just found this PDF about Ruby's new regex engine (in use as of Ruby 1.9 as stated above). \p{Alpha} is available regardless of encoding (and will probably just match [A-Za-z] if there is no Unicode support), while \p{L} is specifically a Unicode property. That means, \p{Alpha} behaves exactly as in POSIX regexes, with the difference that here is corresponds to \p{L}, but in POSIX it corresponds to \p{L&}.

How to match unicode words with ruby 1.9?

# encoding=utf-8 
p "föö".match(/\p{Word}+/)[0] == "föö"

Ruby Regular expression too big / Multiple string match

You might consider other mechanisms for recognizing 10k words.

  • Trie: Sometimes called a prefix tree, it is often used by spell checkers for doing word lookups. See Trie on wikipedia
  • DFA (deterministic finite automata): A DFA is often created by the lexer in a compiler for recognizing the tokens of the language. A DFA runs very quickly. Simple regexes are often compiled into DFAs. See DFA on wikipedia


Related Topics



Leave a reply



Submit