Ruby Regex VS Python Regex

Ruby Regex vs Python Regex

The last time I checked, they differed substantially in their Unicode support. Ruby in 1.9 at least has some very limited Unicode support. I believe one or two Unicode properties might be supported by now. Probably the general categories and maybe the scripts were the two I'm thinking of.

Python has less and more Unicode support at the same time. Python does seem to make it possible to meet the requirements of RL1.2a "Compatability Properties" from UTS#18 on Unicode Regular Expressions.

That said, there is a really rather nice Python library out there by Matthew Barnett (mrab) that finally adds a couple of Unicode properties to Python regexes. He supports the two most important ones: the general categories, and the script properties. It has some other intriguing features as well. It deserves some good publicity.

I don't think either of Ruby or Python support Unicode all that terribly well, although more and more gets done every day. In particular, however, neither meets even the barebones Level 1 requirement for Unicode Regular Expressions cited above. For example, RL1.2 requires that at least 11 properties be supported: General_Category, Script, Alphabetic, Uppercase, Lowercase, White_Space, Noncharacter_Code_Point, Default_Ignorable_Code_Point, ANY, ASCII, and ASSIGNED.

I think Python only lets you get to some of those, and only in a roundabout way. Of course, there are many, many other properties beyond these 11.

When you’re looking for Unicode support, there's more than just UTS#10 on Regular Expressions of course, although that is the one that matters most to this question and neither Ruby nor Puython are Level 1 compliant. Other very important aspects of Unicode include UAX#15, UAX#14, UTS#18, UAX#11, UAX#29, and of course the crucial UAX#44. Python has libraries for at least a couple of those, I know. I don't know that they're standard.

But when it comes to regular expression support, um, there are richer alternatives than just those two, you know. :)

Convert ruby regular expression definition to python regex

To define a named group, you need to use (?P<name>) and then (?p=name) named
If you can afford a 3rd party library, you may use PyPi regex module and use the approach you had in Ruby (as regex supports multiple identically named capturing groups):

s = """%q<Some-name1> "some-name2" 'some-name3'"""

GEM_NAME = r'[a-zA-Z0-9_.-]+'
QUOTED_GEM_NAME = r'(?:(?P<gq>["\'])(?<name>{0})(?P=gq)|%q<(?P<name>{0})>)'.format(GEM_NAME)
print(QUOTED_GEM_NAME)
# => # (?:(?P<gq>["\'])(?<name>[a-zA-Z0-9_.-]+)(?P=gq)|%q<(?P<name>[a-zA-Z0-9_.-]+)>)

import regex
res = [x.group("name") for x in regex.finditer(QUOTED_GEM_NAME, s)]
print(res)
# => ['Some-name1', 'some-name2', 'some-name3']

backreference in the replacement pattern.

See this Python demo.

If you decide to go with Python re, it can't handle identically named groups in one regex pattern.

You can discard the named groups altogether and use numbered ones, and use re.finditer to iterate over all the matches with comprehension to grab the right capture.

Example Python code:

import re
GEM_NAME = r'[a-zA-Z0-9_.-]+'
QUOTED_GEM_NAME = r"([\"'])({0})\1|%q<({0})>".format(GEM_NAME)
s = """%q<Some-name1> "some-name2" 'some-name3'"""
matches = [x.group(2) if x.group(1) else x.group(3) for x in re.finditer(QUOTED_GEM_NAME, s)]
print(matches)
# => ['Some-name1', 'some-name2', 'some-name3']

So, ([\"'])({0})\1|%q<({0})> has got 3 capturing groups: if Group 1 matches, the first alternative got matched, thus, Group 2 is taken, else, the second alternative matched, and Group 3 value is grabbed in the comprehension.

Pattern details

  • ([\"']) - Group 1: a " or '
  • ({0}) - Group 2: GEM_NAME pattern
  • \1 - inline backreference to the Group 1 captured value (note that r'...' raw string literal allows using a single backslash to define a backreference in the string literal)
  • | - or
  • %q< - a literal substring
  • ({0}) - Group 3: GEM_NAME pattern
  • > - a literal >.

Convert python search regex to ruby regex

You may use

resp.body[/window\.__APOLLO_STATE__ = JSON\.parse\("(.*?)"\);/, 1]

Here,

  • /.../ is a regex literal notation that is very convenient when defining regex patterns
  • Literal dots are escaped, else, they match any char but line break chars
  • The .+? is changed to .*? to be able to match empty values (else, you may overmatch, it is easier to later discard empty matches than fix overmatches)
  • 1 tells the engine to return the value of the capturing group with ID 2 of the first match. If you need multiple matches, use resp.body.scan(/regex/).

Do Python regular expressions have an equivalent to Ruby's atomic grouping?

Python does not directly support this feature, but you can emulate it by using a zero-width lookahead assert ((?=RE)), which matches from the current point with the same semantics you want, putting a named group ((?P<name>RE)) inside the lookahead, and then using a named backreference ((?P=name)) to match exactly whatever the zero-width assertion matched. Combined together, this gives you the same semantics, at the cost of creating an additional matching group, and a lot of syntax.

For example, the link you provided gives the Ruby example of

/"(?>.*)"/.match('"Quote"') #=> nil

We can emulate that in Python as such:

re.search(r'"(?=(?P<tmp>.*))(?P=tmp)"', '"Quote"') # => None

We can show that I'm doing something useful and not just spewing line noise, because if we change it so that the inner group doesn't eat the final ", it still matches:

re.search(r'"(?=(?P<tmp>[A-Za-z]*))(?P=tmp)"', '"Quote"').groupdict()
# => {'tmp': 'Quote'}

You can also use anonymous groups and numeric backreferences, but this gets awfully full of line-noise:

re.search(r'"(?=(.*))\1"', '"Quote"') # => None

(Full disclosure: I learned this trick from perl's perlre documentation, which mentions it under the documentation for (?>...).)

In addition to having the right semantics, this also has the appropriate performance properties. If we port an example out of perlre:

[nelhage@anarchique:~/tmp]$ cat re.py
import re
import timeit

re_1 = re.compile(r'''\(
(
[^()]+ # x+
|
\( [^()]* \)
)+
\)
''', re.X)
re_2 = re.compile(r'''\(
(
(?=(?P<tmp>[^()]+ ))(?P=tmp) # Emulate (?> x+)
|
\( [^()]* \)
)+
\)''', re.X)

print timeit.timeit("re_1.search('((()' + 'a' * 25)",
setup = "from __main__ import re_1",
number = 10)

print timeit.timeit("re_2.search('((()' + 'a' * 25)",
setup = "from __main__ import re_2",
number = 10)

We see a dramatic improvement:

[nelhage@anarchique:~/tmp]$ python re.py
96.0800571442
7.41481781006e-05

Which only gets more dramatic as we extend the length of the search string.

Difference between \A \z and ^ $ in Ruby regular expressions

If you're depending on the regular expression for validation, you always want to use \A and \z. ^ and $ will only match up until a newline character, which means they could use an email like me@example.com\n<script>dangerous_stuff();</script> and still have it validate, since the regex only sees everything before the \n.

My recommendation would just be completely stripping new lines from a username or email beforehand, since there's pretty much no legitimate reason for one. Then you can safely use EITHER \A \z or ^ $.

Is there a difference between `[^\b]` and `.`?

[\b] means backspace and [^\b] not a backspace

\b is not a character, it can't be included in a character class.

The negation of a word boundary is \B

Standard Regex vs python regex discrepancy

Thanks for the answers. I feel each answer had part of the answer. Here is what I was looking for.

  1. ? symbol is just a shorthand for (something|ε). Thus (a|ε) can be rewritten as a?. So the example becomes:

    b*(abb*)*a?

    In python we would write:

    p = re.compile(r'^b*(abb*)*a?$')
  2. The reason straight translation of regular regular expression syntax into python (i.e. copy and paste) does not work is because python matches the shortest substring (if the symbols $ or ^ are absent) while the theoretical regular expressions match longest initial substring.

    So for example if we had a string:

    s = 'aa'

    Our textbook regex b*(abb*)*a? would not match it because it has two a's. However if we copy it straight to python:

    >> p = re.compile(r'b*(abb*)*a?')
    >> bool(p.match(s))
    True

    This is because our regex matches only the substring 'a' of our string 'aa'.

    In order to tell python to do a match on the whole string we have to tell it where the beginning and the end of the string is, with the ^ and $ symbols respectively:

    >> p = re.compile(r'^b*(abb*)*a?$')
    >> bool(p.match(s))
    False

    Note that python regex match() matches at the beginning of the string, so it automatically assumes the ^ at the start. However the search() function does not, and thus we keep the ^.

    So for example:

    >> s = 'aa'
    >> p = re.compile(r'b*(abb*)*a?$')
    >> bool(p.match(s))
    False # Correct
    >> bool(p.search(s))
    True # Incorrect - search ignored the first 'a'

Pythonic way to do Ruby-like regular expression replace while evaluating the matched string

re.sub accepts a function as replacement. It gets the match object as sole parameter and returns the replacement string.

If you want to keep it a oneliner, a lambda will do work: re.sub(r'#+', lambda m: "%0"+str(len(m.group(0))), string). I'd just use a small three-line def to avoid having all those parens in one place, but that's just my opinion.



Related Topics



Leave a reply



Submit