Regex Negative Lookbehind in Ruby Doesn't Seem to Work

Regex negative lookbehind in Ruby doesn't seem to work

Ruby's regex engine doesn't support lookbehind (yet).

You'd need to switch to 1.9 or use Oniguruma.


If that's not an option, you can search for |, and replace it with some sort of marker. After all is said and done, put the |, back.

You can also try a regex like:

/(?:[^|]), /

But obviously the (?:[^|]) is not zero-width, which means you'll need to do some extra work afterwards.

Unable to get my Ruby negative look ahead regex to work properly

But, wait, negative lookaheads can be variable length!

R = /
\b # match word break
#{'apples'.reverse} # match 'elppa'
\b # match word break
(?! # begin a negative lookahead
\s+ # match one or more whitespaces
#{'bad'.reverse} # match 'dab'
\b # match word break
) # close negative lookaheaad
/ix # case-indifferent and free-spacing regex definition modes
#=> /
\b # match word break
elppa # match 'selppa'
\b # match word break
(?! # begin a negative lookahead
\s+ # match one or more whitespaces
dab # match 'dab'
\b # match word break
) # close negative lookaheaad
/x

def avoid_bad_apples(str)
str.reverse.match? R
end

avoid_bad_apples("good apples") #=> true
avoid_bad_apples("Simbad apples") #=> true
avoid_bad_apples("bad pears") #=> false
avoid_bad_apples("bad apples") #=> false
avoid_bad_apples("bad apples") #=> false
avoid_bad_apples("good applesauce") #=> false
avoid_bad_apples("Very bad apples. BAD!") #=> false

How to do a negative lookbehind within a %r…-delimited regexp in Ruby?

As others have mentioned, seems like an oversight based on how this character differs from other paired boundaries.

As far as "Is there really no way to escape the < here?" there is a way... but you're not going to like it:

%r<(?#{'<'}!foo)> == %r((?<!foo))

Using interpolation to insert the < character seems to work. But given that there are much better options, I would avoid it unless you were planning on splitting the regex into sections anyway...

regex negative look-ahead in ruby 1.9.3 vs 2.0.0

Better use this, which looks more convincing:

matched = array.grep(/^(?!foo\s).*\.bar$/)

NOT starting with foo

this will work in both 2.1.1 and 1.9.3

only if you want to see what I did:

# ruby-1.9.3-p362
array = ["foo a.bar", "b.bar"]
# => ["foo a.bar", "b.bar"]
matched = array.grep(/(?!^foo\s).*\.bar$/)
# => ["foo a.bar", "b.bar"]
matched = array.grep(/^(?!foo\s).*\.bar$/)
# => ["b.bar"]
matched = array.grep(/(?!^foo\s).*\.bar$/)
# => ["foo a.bar", "b.bar"]

# ruby-2.1.1
array = ["foo a.bar", "b.bar"]
# => ["foo a.bar", "b.bar"]
matched = array.grep(/(?!^foo\s).*\.bar$/)
# => ["b.bar"]
matched = array.grep(/^(?!foo\s).*\.bar$/)
# => ["b.bar"]

Regular Expression Lookbehind doesn't work with quantifiers ('+' or '*')

Many regular expression libraries do only allow strict expressions to be used in look behind assertions like:

  • only match strings of the same fixed length: (?<=foo|bar|\s,\s) (three characters each)
  • only match strings of fixed lengths: (?<=foobar|\r\n) (each branch with fixed length)
  • only match strings with a upper bound length: (?<=\s{,4}) (up to four repetitions)

The reason for these limitations are mainly because those libraries can’t process regular expressions backwards at all or only a limited subset.

Another reason could be to avoid authors to build too complex regular expressions that are heavy to process as they have a so called pathological behavior (see also ReDoS).

See also section about limitations of look-behind assertions on Regular-Expressions.info.

Alternative to negative lookbehind?

Let's first consider how it would be done with a lookbehind.
Then we just check if before what we capture is the start of the line, or a whitespace:

(?<=^|\s)(\.\d{5,})

We could simply change that lookbehind to a normal capture group.

Which means a preceding whitespace also gets captured. But in a replace we can just use or not use that capture group 1.

(^|\s)(\.\d{5,})

In the PCRE regex engine we have \K

\K : resets the starting point of the reported match. Any previously
consumed characters are no longer included in the final match

So by using that \K in the regex, the preceding space isn't included in the match

(?:^|\s)\K(\.\d{5,})

A test here

However, if you use Rubi's scan with a regex that has capture groups?

Then it seems that it only outputs the capture groups (...), but not the non-capture groups (?:...) or what's not in a capture group.

For example:

m = '.12345 .123456 NOT.1234567'.scan(/(?:^|\s)(\.\d{5,})/)
=> [[".12345"], [".123456"]]

m = 'ab123cd'.scan(/[a-z]+(\d+)(?:[a-z]+)/)
=> [["123"]]

So when you use scan, lookarounds don't need to be used.

Regular Expressions with lookahead in Ruby

Code:

testString = 'this is, a , sentence33 Here, is another.';
result = testString.gsub(/\,(?=.*\d)/, 'comma');
print result;

Output:

this iscomma a comma sentence33 Here, is another.

Test:

http://ideone.com/9nt1b

Problem with quantifiers and look-behind

The issue is that Ruby doesn't support variable-length lookbehinds. Quantifiers aren't out per se, but they can't cause the length of the lookbehind to be nondeterministic.

Perl has the same restriction, as does just about every major language featuring regexes.

Try using the straightforward match (\w*)\W*?o instead of the lookbehind.

Understanding negative look aheads in regular expressions

In both of your cases, ^ is just the start of the line (since it's not used inside a character class). Since both ^ and the lookahead are zero-width assertions, we can switch them around in the first case - I think that makes it a bit easier to explain:

^(?!.*localhost).*$ 

The ^ anchors the expression to the beginning of the string. The lookahead then starts from that position and tries to find localhost anywhere the string (the "anywhere" is taken care of by the .* in front of localhost). If that localhost can be found, the subexpression of the lookahead matches and therefore the negative lookahead causes the pattern to fail. Since the lookahead is bound to start at the beginning of the string by the adjacent ^ this means, the pattern overall cannot match. If, however the .*localhost does not match (and hence localhost does not occur in the string), the lookahead succeeds, and the .*$ simply takes care of matching the rest of the string.

Now the other one

^((?!localhost).)*$

This time the lookahead only checks at the current position (there is no .* inside it). But the lookahead is repeated for every single character. This way it does check every single position again. Here is roughly what happens: the ^ makes sure that we're starting at the beginning of the string again. The lookahead checks whether the word localhost is found at that position. If not, all is well, and . consumes one character. The * then repeats both of those steps. We are now one character further in the string, and the lookahead checks whether the second character starts the word localhost - again, if not, all is well, and . consumes another character. This is done for every single character in the string, until we reach the end.

In this particular case both methods are equivalent, and you could select one based on performance (if it matters) or readability (if not; probably the first one). However, in other cases the second variant is preferable, because it allows you to do this repetition for a fixed part of the string, whereas the first variant will always check the entire string.



Related Topics



Leave a reply



Submit