How the Look-Ahead and Look-Behind Concept Supports Such Zero-Width Assertions Concept in Regex of Ruby

How the Look-ahead and Look-behind concept supports such Zero-Width Assertions concept in Regex of Ruby?

Regular expressions match from left to right, and move a sort of "cursor" along the string as they go. If your regex contains a regular character like a, this means: "if there's a letter a in front of the cursor, move the cursor ahead one character, and keep going. Otherwise, something's wrong; back up and try something else." So you might say that a has a "width" of one character.

A "zero-width assertion" is just that: it asserts something about the string (i.e., doesn't match if some condition doesn't hold), but it doesn't move the cursor forwards, because its "width" is zero.

You're probably already familiar with some simpler zero-width assertions, like ^ and $. These match the start and end of a string. If the cursor isn't at the start or end when it sees those symbols, the regex engine will fail, back up, and try something else. But they don't actually move the cursor forwards, because they don't match characters; they only check where the cursor is.

Lookahead and lookbehind work the same way. When the regex engine tries to match them, it checks around the cursor to see if the right pattern is ahead of or behind it, but in case of a match, it doesn't move the cursor.

Consider:

/(?=foo)foo/.match 'foo'

This will match! The regex engine goes like this:

  1. Start at the beginning of the string: |foo.
  2. The first part of the regex is (?=foo). This means: only match if foo appears after the cursor. Does it? Well, yes, so we can proceed. But the cursor doesn't move, because this is zero-width. We still have |foo.
  3. Next is f. Is there an f in front of the cursor? Yes, so proceed, and move the cursor past the f: f|oo.
  4. Next is o. Is there an o in front of the cursor? Yes, so proceed, and move the cursor past the o: fo|o.
  5. Same thing again, bringing us to foo|.
  6. We reached the end of the regex, and nothing failed, so the pattern matches.

On your four assertions in particular:

  • (?=...) is "lookahead"; it asserts that ... does appear after the cursor.

    1.9.3p125 :002 > 'jump june'.gsub(/ju(?=m)/, 'slu')
    => "slump june"

    The "ju" in "jump" matches because an "m" comes next. But the "ju" in "june" doesn't have an "m" next, so it's left alone.

    Since it doesn't move the cursor, you have to be careful when putting anything after it. (?=a)b will never match anything, because it checks that the next character is a, then also checks that the same character is b, which is impossible.

  • (?<=...) is "lookbehind"; it asserts that ... does appear before the cursor.

    1.9.3p125 :002 > 'four flour'.gsub(/(?<=f)our/, 'ive')
    => "five flour"

    The "our" in "four" matches because there's an "f" immediately before it, but the "our" in "flour" has an "l" immediately before it so it doesn't match.

    Like above, you have to be careful with what you put before it. a(?<=b) will never match, because it checks that the next character is a, moves the cursor, then checks that the previous character was b.

  • (?!...) is "negative lookahead"; it asserts that ... does not appear after the cursor.

    1.9.3p125 :003 > 'child children'.gsub(/child(?!ren)/, 'kid')
    => "kid children"

    "child" matches, because what comes next is a space, not "ren". "children" doesn't.

    This is probably the one I get the most use out of; finely controlling what can't come next comes in handy.

  • (?<!...) is "negative lookbehind"; it asserts that ... does not appear before the cursor.

    1.9.3p125 :004 > 'foot root'.gsub(/(?<!r)oot/, 'eet')
    => "feet root"

    The "oot" in "foot" is fine, since there's no "r" before it. The "oot" in "root" clearly has an "r".

    As an additional restriction, most regex engines require that ... has a fixed length in this case. So you can't use ?, +, *, or {n,m}.

You can also nest these and otherwise do all kinds of crazy things. I use them mainly for one-offs I know I'll never have to maintain, so I don't have any great examples of real-world applications handy; honestly, they're weird enough that you should try to do what you want some other way first. :)


Afterthought: The syntax comes from Perl regular expressions, which used (? followed by various symbols for a lot of extended syntax because ? on its own is invalid. So <= doesn't mean anything by itself; (?<= is one entire token, meaning "this is the start of a lookbehind". It's like how += and ++ are separate operators, even though they both start with +.

They're easy to remember, though: = indicates looking forwards (or, really, "here"), < indicates looking backwards, and ! has its traditional meaning of "not".


Regarding your later examples:

irb(main):002:0> "foresight".sub(/(?!s)ight/, 'ee')
=> "foresee"

irb(main):003:0> "foresight".sub(/ight/, 'ee')
=> "foresee"

Yes, these produce the same output. This is that tricky bit with using lookahead:

  1. The regex engine has tried some things, but they haven't worked, and now it's at fores|ight.
  2. It checks (?!s). Is the character after the cursor s? No, it's i! So that part matches and the matching continues, but the cursor doesn't move, and we still have fores|ight.
  3. It checks ight. Does ight come after the cursor? Well, yes, it does, so move the cursor: foresight|.
  4. We're done!

The cursor moved over the substring ight, so that's the full match, and that's what gets replaced.

Doing (?!a)b is useless, since you're saying: the next character must not be a, and it must be b. But that's the same as just matching b!

This can be useful sometimes, but you need a more complex pattern: for example, (?!3)\d will match any digit that isn't a 3.

This is what you want:

1.9.3p125 :001 > "foresight".sub(/(?<!s)ight/, 'ee')
=> "foresight"

This asserts that s doesn't come before ight.

Regex lookahead, lookbehind and atomic groups

Examples

Given the string foobarbarfoo:

bar(?=bar)     finds the 1st bar ("bar" which has "bar" after it)
bar(?!bar) finds the 2nd bar ("bar" which does not have "bar" after it)
(?<=foo)bar finds the 1st bar ("bar" which has "foo" before it)
(?<!foo)bar finds the 2nd bar ("bar" which does not have "foo" before it)

You can also combine them:

(?<=foo)bar(?=bar)    finds the 1st bar ("bar" with "foo" before it and "bar" after it)

Definitions

Look ahead positive (?=)

Find expression A where expression B follows:

A(?=B)

Look ahead negative (?!)

Find expression A where expression B does not follow:

A(?!B)

Look behind positive (?<=)

Find expression A where expression B precedes:

(?<=B)A

Look behind negative (?<!)

Find expression A where expression B does not precede:

(?<!B)A

Atomic groups (?>)

An atomic group exits a group and throws away alternative patterns after the first matched pattern inside the group (backtracking is disabled).

  • (?>foo|foot)s applied to foots will match its 1st alternative foo, then fail as s does not immediately follow, and stop as backtracking is disabled

A non-atomic group will allow backtracking; if subsequent matching ahead fails, it will backtrack and use alternative patterns until a match for the entire expression is found or all possibilities are exhausted.

  • (foo|foot)s applied to foots will:

    1. match its 1st alternative foo, then fail as s does not immediately follow in foots, and backtrack to its 2nd alternative;
    2. match its 2nd alternative foot, then succeed as s immediately follows in foots, and stop.

Some resources

  • http://www.regular-expressions.info/lookaround.html
  • http://www.rexegg.com/regex-lookarounds.html

Online testers

  • https://regex101.com

Regex to match string containing two names in any order

You can do checks using positive lookaheads. Here is a summary from the indispensable regular-expressions.info:

Lookahead and lookbehind, collectively called “lookaround”, are
zero-length assertions...lookaround actually matches characters, but
then gives up the match, returning only the result: match or no match.
That is why they are called “assertions”. They do not consume
characters in the string, but only assert whether a match is possible
or not.

It then goes on to explain that positive lookaheads are used to assert that what follows matches a certain expression without taking up characters in that matching expression.

So here is an expression using two subsequent postive lookaheads to assert that the phrase matches jack and james in either order:

^(?=.*\bjack\b)(?=.*\bjames\b).*$

Test it.

The expressions in parentheses starting with ?= are the positive lookaheads. I'll break down the pattern:

  1. ^ asserts the start of the expression to be matched.
  2. (?=.*\bjack\b) is the first positive lookahead saying that what follows must match .*\bjack\b.
  3. .* means any character zero or more times.
  4. \b means any word boundary (white space, start of expression, end of expression, etc.).
  5. jack is literally those four characters in a row (the same for james in the next positive lookahead).
  6. $ asserts the end of the expression to me matched.

So the first lookahead says "what follows (and is not itself a lookahead or lookbehind) must be an expression that starts with zero or more of any characters followed by a word boundary and then jack and another word boundary," and the second look ahead says "what follows must be an expression that starts with zero or more of any characters followed by a word boundary and then james and another word boundary." After the two lookaheads is .* which simply matches any characters zero or more times and $ which matches the end of the expression.

"start with anything then jack or james then end with anything" satisfies the first lookahead because there are a number of characters then the word jack, and it satisfies the second lookahead because there are a number of characters (which just so happens to include jack, but that is not necessary to satisfy the second lookahead) then the word james. Neither lookahead asserts the end of the expression, so the .* that follows can go beyond what satisfies the lookaheads, such as "then end with anything".

I think you get the idea, but just to be absolutely clear, here is with jack and james reversed, i.e. "start with anything then james or jack then end with anything"; it satisfies the first lookahead because there are a number of characters then the word james, and it satisfies the second lookahead because there are a number of characters (which just so happens to include james, but that is not necessary to satisfy the second lookahead) then the word jack. As before, neither lookahead asserts the end of the expression, so the .* that follows can go beyond what satisfies the lookaheads, such as "then end with anything".

This approach has the advantage that you can easily specify multiple conditions.

^(?=.*\bjack\b)(?=.*\bjames\b)(?=.*\bjason\b)(?=.*\bjules\b).*$

Regular Expression for partial match length - String similarity

The best approach, by far, would be to build a suffix tree on the input string. Building suffix trees takes only O(n) time where n is the length of the string. A suffix tree consists logically of a tree in which all the suffixes of the string can be found by walking from the root to each leaf. You can read Wikipedia for more details on how these trees work.

Essentially, a suffix tree will allow you to simply recast your current problem as one of "finding" the original string in the suffix tree. As you walk down the tree, you count the number of suffixes in each subtree, and multiply by your current match length to determine your score. This "search" takes O(n) time too.

So the final result is that you can solve the problem in guaranteed O(n) time and O(n) space, with O(n) preprocessing time. This is pretty efficient! And, there are no "worst cases" that produce quadratic behaviour. With that you can probably handle strings up to 10^7 in length easily.

The only difficulty in implementation will be building the suffix tree, but you can find freely available code for that.

Regular expression to match a line that doesn't contain a word

The notion that regex doesn't support inverse matching is not entirely true. You can mimic this behavior by using negative look-arounds:

^((?!hede).)*$

Non-capturing variant:

^(?:(?!:hede).)*$

The regex above will match any string, or line without a line break, not containing the (sub)string 'hede'. As mentioned, this is not something regex is "good" at (or should do), but still, it is possible.

And if you need to match line break chars as well, use the DOT-ALL modifier (the trailing s in the following pattern):

/^((?!hede).)*$/s

or use it inline:

/(?s)^((?!hede).)*$/

(where the /.../ are the regex delimiters, i.e., not part of the pattern)

If the DOT-ALL modifier is not available, you can mimic the same behavior with the character class [\s\S]:

/^((?!hede)[\s\S])*$/

Explanation

A string is just a list of n characters. Before, and after each character, there's an empty string. So a list of n characters will have n+1 empty strings. Consider the string "ABhedeCD":

    ┌──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┐
S = │e1│ A │e2│ B │e3│ h │e4│ e │e5│ d │e6│ e │e7│ C │e8│ D │e9│
└──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┘

index 0 1 2 3 4 5 6 7

where the e's are the empty strings. The regex (?!hede). looks ahead to see if there's no substring "hede" to be seen, and if that is the case (so something else is seen), then the . (dot) will match any character except a line break. Look-arounds are also called zero-width-assertions because they don't consume any characters. They only assert/validate something.

So, in my example, every empty string is first validated to see if there's no "hede" up ahead, before a character is consumed by the . (dot). The regex (?!hede). will do that only once, so it is wrapped in a group, and repeated zero or more times: ((?!hede).)*. Finally, the start- and end-of-input are anchored to make sure the entire input is consumed: ^((?!hede).)*$

As you can see, the input "ABhedeCD" will fail because on e3, the regex (?!hede) fails (there is "hede" up ahead!).

A Regular Expression with a Conditional in Lookbehind

I think I understand the issue now. The important difference between a conditional inside of a lookbehind and a conditional not inside a lookbehind is the timing of when the conditional is executed (or where the search cursor is at that point.

The example without a lookbehind:

(?(?=go)good|bad)\smorning

good morning

The conditional is run at the beginning of the search. So the search cursor is before the 'g' in 'good'.

 good morning
^

So at this point the look ahead evaluates to TRUE since it sees matches on the 'go'

In the example with the lookbehind, the cursor is at a different location.

(?<=(?(?=go)good|bad)\s)morning

The search cursor finds the first required item in the text: 'morning'

good morning
^

The actual search cursor stays in place to consume 'morning' if it the lookbehind matches. The lookbehind will use its own cursor to verify what is before 'morning' to determine if this 'morning' is a match. The lookbehind states that there is a '\s' directly before 'morning' and indeed there is. The temporary lookbehind cursor moves to the space:

good morning
^^

Now it gets to the conditional and runs the lookahead in the conditional statement. At this point the lookahead is looking for 'go' but it sees ' morning'. So the conditional fails. The expression says to try to match on 'bad' (or dab backwards from the lookbehind cursor) but it sees good (or doog backwards from the lookbehind cursor). So no match for 'morning'.

Solution

Since the conditional is run at the end of the word of interest when it is in a lookbehind (instead of at the beginning of that word outside of a lookbehind), the secret is to reverse the conditional to a lookbehind instead of a lookahead:

(?<=(?(?<=good)good|bad)\s)morning

This actually matches on morning. It look nonsensical in this example to look for a word and then match it - but it illustrates the concept. In the real world case stated in the question, the solution looks like this:

(?<=(?(?<=\s\S{1,2})[A-Z0-9]+|(?![A-Z]+\s)[0-9-A-Z/"']+))\s+matching\stext

This looks for a word before matching text. The conditional checks to see if it is a one or two character word. If so, it should consist of only letters and numbers. If not, it must not only consist of letters. If this is met, it can consist of letters, numbers, dashes, slashes, quotes and apostrophes.

Two things changed:

  1. Modified the conditional to be a lookbehind instead of a lookahead
  2. I had to move the '\s' from the beginning of the lookbehind into the conditional, because the conditional is being processed before that space and it results in matching on the last one or two characters of the word instead of looking for a one or two character word. That is a tricky issue, because when the expression is not in a lookbehind (for instance if you want the text in the lookbehind to be included in the match), this change messes up the match.

What follows is some more analysis of the original problem. Enjoy!

@zx81 had an example that is actually better at illustrating what is going on. This example has more cursor movement so it does help illustrate what is happening:

    (?<=(?(?=go)good|bad))\w+pher

badphilosopher <-- 'philosopher' matches
goodgopher <-- 'gopher' matches
badgopher <-- no match

There is a big difference in this example because the \w+ is used. So the regex engine immediately matches all the text in each example since the phrase has no white space and ends in 'pher'.

So for 'badphilosopher':

Lookbehind is run and the conditional is immediately run looking for 'go' but finds 'ba'

badphilosopher
^

The condition failed so it tries to match bad to the left of the cursor, but we are at the beginning of the phrase, so no match.

It checks again at these two cursor points because the '\w+pher' matches each time:

badphilosopher
^

But the lookbehind sees 'b' then 'ba'

When the cursor gets to:

badphilosopher
^

The conditonal again fails to find a 'go' (it sees 'ph') so it attempts to match 'bad' to the left of the cursor and finds it! So there fore the \w+pher matches the philosopher text.

goodgopher
^

goodgopher matches in a similar way except that the conditional is successful.

badgopher
^

badgopher does not match because the conditonal is successful but 'good' is not found to the left of the cursor.

Putting a space in there really changes things up, because the /w+pher no longer matches the entire string.

    (?<=(?(?=go)good|bad)\s+)\w+pher

bad philosopher <-- matches philosopher
good gopher <-- no match
bad gopher <-- matches gopher

In this case the cursor moves through the string until it can match \w+pher:

bad philosopher
^

At this point it starts to process the lookbehind -- and sees that a '\s+' is required to the left of the search cursor -- it finds this and moves the temporary lookbehind cursor over.

bad philosopher
^^

The conditional is run now and sees looking for 'go' at the temp lookbehind cursor but finds ' p'. The failure means trying to match bad to the left of the temp lookbehind cursor and indeed it finds it there.

good gopher
^^

The 'good gopher' example gets to the conditional and sees ' g' so it fails and then looks for 'bad' to the left of the cursor and doesn't find it. So this fails.

bad philosopher
^^

Similarly, 'bad philosopher' gets to the conditonal and finds ' p' and looks for 'bad' to the left of the cursor and finds it. So it matches.

When run without the lookbehind, all of these examples match. This can be rather counterintuitive - but you have to take the location of the cursors into account in the lookbehind.

Variable-length lookbehind-assertion alternatives for regular expressions

Most of the time, you can avoid variable length lookbehinds by using \K.

s/(?<=foo.*)bar/moo/s;

would be

s/foo.*\Kbar/moo/s;

Anything up to the last \K encountered is not considered part of the match (e.g. for the purposes of replacement, $&, etc)

Negative lookbehinds are a little trickier.

s/(?<!foo.*)bar/moo/s;

would be

s/^(?:(?!foo).)*\Kbar/moo/s;

because (?:(?!STRING).)* is to STRING as [^CHAR]* is to CHAR.


If you're just matching, you might not even need the \K.

/foo.*bar/s

/^(?:(?!foo).)*bar/s

What does \@= and \@= mean in Vim command?

From vim documentation for patterns

\@= Matches the preceding atom with zero width. {not in Vi}
Like "(?=pattern)" in Perl.
Example matches
foo\(bar\)\@= "foo" in "foobar"
foo\(bar\)\@=foo nothing

*/zero-width*
When using "\@=" (or "^", "$", "\<", "\>") no characters are included
in the match. These items are only used to check if a match can be
made. This can be tricky, because a match with following items will
be done in the same position. The last example above will not match
"foobarfoo", because it tries match "foo" in the same position where
"bar" matched.

Note that using "\&" works the same as using "\@=": "foo\&.." is the
same as "\(foo\)\@=..". But using "\&" is easier, you don't need the
braces.

\@<= Matches with zero width if the preceding atom matches just before what
follows. |/zero-width| {not in Vi}
Like '(?<=pattern)" in Perl, but Vim allows non-fixed-width patterns.
Example matches
\(an\_s\+\)\@<=file "file" after "an" and white space or an
end-of-line
For speed it's often much better to avoid this multi. Try using "\zs"
instead |/\zs|. To match the same as the above example:
an\_s\+\zsfile

"\@<=" and "\@<!" check for matches just before what follows.
Theoretically these matches could start anywhere before this position.
But to limit the time needed, only the line where what follows matches
is searched, and one line before that (if there is one). This should
be sufficient to match most things and not be too slow.
The part of the pattern after "\@<=" and "\@<!" are checked for a
match first, thus things like "\1" don't work to reference \(\) inside
the preceding atom. It does work the other way around:
Example matches
\1\@<=,\([a-z]\+\) ",abc" in "abc,abc"

Couldn't understand the actual use of -Fpat switch in ruby

Try this:

@ubuntu:~$ ruby -a -n -Fp -e 'puts $_;puts $F[3]'
apf drvoph hlpvg tlpbbpnn
apf drvoph hlpvg tlpbbpnn
vg tl

@ubuntu:~$ ruby -a -n -e 'puts $_;puts $F[3]'
apf drvoph hlpvg tlpbbpnn
apf drvoph hlpvg tlpbbpnn
tlpbbpnn

The pattern after -F ('p' in this case) is used as the separator instead of a blank space.



Related Topics



Leave a reply



Submit