"Vertical" Regex Matching in an Ascii "Image"

vertical regex matching in an ASCII image


Answer to question 1

To answer the first question one could use:

(?xm)                    # ignore comments and whitespace, ^ matches beginning of line
^ # beginning of line
(?:
. # any character except \n
(?= # lookahead
.*+\n # go to next line
( \1?+ . ) # add a character to the 1st capturing group
.*+\n # next line
( \2?+ . ) # add a character to the 2nd capturing group
)
)*? # repeat as few times as needed
X .*+\n # X on the first line and advance to next line
\1?+ # if 1st capturing group is defined, use it, consuming exactly the same number of characters as on the first line
X .*+\n # X on the 2nd line and advance to next line
\2?+ # if 2st capturing group is defined, use it, consuming exactly the same number of characters as on the first line
X # X on the 3rd line

Online demo

This expression works in Perl, PCRE, Java and should work in .NET.

The expression uses lookaheads with self referencing capturing groups to add a character for every repetition of the lookahead (this is used to "count").

\1?+ means if \1 matches (or is defined) consume it, and don't give it back (don't backtrack). In this case it's equivalent to (?(1) \1 ). Which means match \1 if \1 is defined.

polygenelubricants explains this kinds of lookaheads with backreferences very nicely in his answer for How can we match a^n b^n with Java regex?. (He has also written about other impressive tricks for Java regex involving backreferences and lookarounds.)

Answer to question 2

Plain matching

When just using matching and requiring the answer (count) in the number of matches, then the question 2 answer would be:

It can not be directly solved in regex flavors that have a limited lookbehind. While other flavors like Java and .NET could (as for example in m.buettner's .NET solution).

Thus plain regex matches in Perl and PCRE (PHP, etc) cannot directly answer this question in this case.

(Semi?)proof

Assume that no variable length lookbehinds are available.

You have to in some way count the number of characters on a line before an X.

Only way to do that is to match them, and since no variable length lookbehinds are available you have to start the match (at least) at the beginning of the line.

If you start the match at the beginning of a line you can only get at most one match per line.

Since there can be multiple occurrences per line, this would not count them all and would not give a correct answer.

Length/indirect solution

On the other hand if we accept the answer as the length of a match or substitution result, then the 2nd question can be answered in PCRE and Perl (and other flavors).

This solution is based on/inspired by m.buettner's nice "partial PCRE solution".

One could simply replace all matches of the following expression with $3, getting the answer to question two (the number of patterns of interests) as the length of the resulting string.

^
(?:
(?: # match .+? characters
.
(?= # counting the same number on the following two lines
.*+\n
( \1?+ . )
.*+\n
( \2?+ . )
)
)+?
(?<= X ) # till the above consumes an X
(?= # that matches the following conditions
.*+\n
\1?+
(?<= X )
.*+\n
\2?+
(?<= X )
)
(?= # count the number of matches
.*+\n
( \3?+ . ) # the number of matches = length of $3
)
)* # repeat as long as there are matches on this line
.*\n? # remove the rest of the line

Which in Perl could be written as:

$in =~ s/regex/$3/gmx;
$count = length $in;

Online demo

This expression is similar to the solution to question 1 above, with some modifications to include X in the characters matched in the first lookahead, wrapped with a quantifier and counting number of matches of the quantifier.

Except for direct matches this is as close as it gets (extra code wise besides regex), and could be an acceptable answer to question 2.

Test cases

Some test cases and results for the above solution. Result showing the numerical answer (length of the resulting string) and in parenthesis the resulting string after the substitution(s).

Test #0:
--------------------
X
X
X

result: 1 (X)


Test #1:
--------------------
..X....
..X....
..X....

result: 1 (.)


Test #2:
--------------------
..X.X..
..X.X..
....X..

result: 1 (.)


Test #3:
--------------------
..X....
..X....
...X...

result: 0 ()


Test #4:
--------------------
..X....
...X...
..X....

result: 0 ()


Test #5:
--------------------
....X..
.X..X..
.X.....

result: 0 ()


Test #6:
--------------------
.X..X..
.X.X...
.X.X...

result: 1 (.)


Test #7:
--------------------
.X..X..
.X..X..
.X..X..

result: 2 (.X)


Test #8:
--------------------
XXX
XXX
XXX

result: 3 (XXX)


Test #9:
--------------------
X.X.X
XXXXX
XXXXX
.X.X.

result: 5 (XXXXX)


Test #10:
--------------------
1....X.......
2..X..X...X....
3X.X...X..X.....
4X....XXXXXX.....
5X..XXX...........
6.....X..........
7.........X....X
8..X......X....X....
9..X......X....X....X...
A....X.....
B.X..X..
C.....
XXX
XXX
XXX
.

result: 8 (3458.XXX)

php vertical regular expression search

You can do something like this (without regular expressions) :

$map = explode("\n", $inputmap);

for ($vcount = 0; $vcount < sizeof($map); ++$vcount) { // Loop through the map vertically
for ($hcount = 0; $hcount < strlen($map[$vcount]); ++$hcount) { // Loop through each character of each line
if ($map[$vcount][$hcount] == "F") {
if ($map[$vcount + 1][$hcount - 1] == "F" && $map[$vcount + 1][$hcount] == "F" &&
$map[$vcount + 1][$hcount + 1] == "F" && $map[$vcount + 2][$hcount] == "F")
echo "Pattern found, starting at : (v)$vcount x (h)$hcount";
}
}
}

$> php test.php
php test.php
Pattern found, starting at : (v)18 x (h)8
Pattern found, starting at : (v)22 x (h)30
$>

But I must admit that it could take a while for extra-big maps.

/!\ add some code to verify that the line $vcount + 1/2 actually exists, and the char $hcount + 1 too.

What is a regular expression that will match a character at certain indexes?

You can use this regex:

([a-zA-Z])..\1..\1

RegEx Demo

  • ([a-zA-Z]) will match and capture any letter in group #`
  • Then another 2 instances of same letter at n+3 and n+6 positions are matched using back-reference \1, where n is the first position of letter.

Matching a^n b^n c^n for n 0 with PCRE


Qtax trick

(The mighty self-referencing capturing group)

^(?:a(?=a*(\1?+b)b*(\2?+c)))+\1\2$

This solution is also named "The Qtax trick" because it uses the same technique as from "vertical" regex matching in an ASCII "image" by Qtax.


The problem in question burns down to a need to assert that three groups are matched of the same length. As a simplified version, to match:

xyz

Where x, y and z are really just subpatterns with a variable with matching length n of a, b and c. With an expression that uses lookaheads with self-referencing capturing groups, a character we specify is added to each repetition of the lookahead, which can effectively be used to "count":

aaabbbccc
^ ^ ^

This is achieved by the following:

  • (?:a…)+ A character of subpattern a is matched. With (?=a*, we skip directly to the "counter".
  • (\1?+b) Capturing group (\1) effectively consumes whatever has previously been matched, if it is there, and uses a possessive match which does not permit backtracking, and the match fails if the counter goes out of sync - That is, there has been more of subpatterns b than subpattern a. On the first iteration, this is absent, and nothing is matched. Then, a character of subpattern b is matched. It is added to the capturing group, effectively "counting" one of b in the group. With b*, we skip directly to the next "counter".
  • (\2?+c) Capturing group (\2) effectively consumes whatever has previously been matched just like the above. Because this additional character capture works just like the previous group, characters are allowed to sync up in length within these character groups. Assuming continuous sequences of a..b..c..:

(Excuse my art.)

First iteration:

| The first 'a' is matched by the 'a' in '^(?:a…)'.
| The pointer is stuck after it as we begin the lookahead.
v,- Matcher pointer
aaaa......cccc...
^^^ |^^^ ^
skipped| skipped Matched by c in (\2?+c);
by a* | by b* \2 was "nothing",
| now it is "c".
Matched by b
in (\1?+b).
\1 was "nothing", now it is "b".

Second iteration:

 | The second 'a' is matched by the 'a' in '^(?:a…)'.
| The pointer is stuck after it as we begin the lookahead.
v,- Matcher pointer
aaaa......cccc...
/|^^^ |^
eaten by| skipped |Matched by c in (\2?+c);
\1?+ | by b* | '\2' was "nothing",
^^ | \2?+ now it is "cc".
skipped|
by a* \ Matched by b
in (\1?+b).
'\1' was "nothing", now it is "bb".

As the three groups discussed above "consumes" one of each of a, b, c respectively, they are matched in round-robin style and "counted" by the (?:a…)+, (\1?+b) and (\2?+c) groups respectively. With the additional anchoring and capturing what we started, we can assert that we match xyz (Representing each group above) where x, y and z are an, bn and cn respectively.


As a bonus, to "count" more, one can do this:

Pattern: ^(?:a(?=a*(\1?+b)b*(\2?+c)))+\1{3}\2$
Matches: abbbc
aabbbbbbcc
aaabbbbbbbbbccc
Pattern: ^(?:a(?=a*(\1?+bbb)b*(\2?+c)))+\1\2$
Matches: abbbc
aabbbbbbcc
aaabbbbbbbbbccc

Capturing Quantifiers and Quantifier Arithmetic

I don't know a regex engine that can capture a quantifier. However, it is possible with PCRE or Perl to use some tricks to check if you have the same number of characters. With your example:

@@@@ "Star Wars" ==== "1977" ---- "Science Fiction" //// "George Lucas"
you can check if @ = - / are balanced with this pattern that uses the famous Qtax trick, (are you ready?): the "possessive-optional self-referencing group"

~(?<!@)((?:@(?=[^=]*(\2?+=)[^-]*(\3?+-)[^/]*(\4?+/)))+)(?!@)(?=[^=]*\2(?!=)[^-]*\3(?!-)[^/]*\4(?!/))~

pattern details:

~                          # pattern delimiter
(?<!@) # negative lookbehind used as an @ boundary
( # first capturing group for the @
(?:
@ # one @
(?= # checks that each @ is followed by the same number
# of = - /
[^=]* # all that is not an =
(\2?+=) # The possessive optional self-referencing group:
# capture group 2: backreference to itself + one =
[^-]*(\3?+-) # the same for -
[^/]*(\4?+/) # the same for /
) # close the lookahead
)+ # close the non-capturing group and repeat
) # close the first capturing group
(?!@) # negative lookahead used as an @ boundary too.

# this checks the boundaries for all groups
(?=[^=]*\2(?!=)[^-]*\3(?!-)[^/]*\4(?!/))
~

The main idea

The non-capturing group contains only one @. Each time this group is repeated a new character is added in capture groups 2, 3 and 4.

the possessive-optional self-referencing group

How does it work?

( (?: @ (?= [^=]* (\2?+ = ) .....) )+ )

At the first occurence of the @ character the capture group 2 is not yet defined, so you can not write something like that (\2 =) that will make the pattern fail. To avoid the problem, the way is to make the backreference optional: \2?

The second aspect of this group is that the number of character = matched is incremented at each repetition of the non capturing group, since an = is added each time. To ensure that this number always increases (or the pattern fails), the possessive quantifier forces the backreference to be matched first before adding a new = character.

Note that this group can be seen like that: if group 2 exists then match it with the next =

( (?(2)\2) = )

The recursive way

~(?<!@)(?=(@(?>[^@=]+|(?-1))*=)(?!=))(?=(@(?>[^@-]+|(?-1))*-)(?!-))(?=(@(?>[^@/]+|(?-1))*/)(?!/))~

You need to use overlapped matches, since you will use the @ part several times, it is the reason why all the pattern is inside lookarounds.

pattern details:

(?<!@)                # left @ boundary
(?= # open a lookahead (to allow overlapped matches)
( # open a capturing group
@
(?> # open an atomic group
[^@=]+ # all that is not an @ or an =, one or more times
| # OR
(?-1) # recursion: the last defined capturing group (the current here)
)* # repeat zero or more the atomic group
= #
) # close the capture group
(?!=) # checks the = boundary
) # close the lookahead
(?=(@(?>[^@-]+|(?-1))*-)(?!-)) # the same for -
(?=(@(?>[^@/]+|(?-1))*/)(?!/)) # the same for /

The main difference with the precedent pattern is that this one doesn't care about the order of = - and / groups. (However you can easily make some changes to the first pattern to deal with that, with character classes and negative lookaheads.)

Note: For the example string, to be more specific, you can replace the negative lookbehind with an anchor (^ or \A). And if you want to obtain the whole string as match result you must add .* at the end (otherwise the match result will be empty as playful notices it.)

Pre-compiled regex with special characters matching

You do need to escape the *. One way to do it is by the quotemeta \Q operator:

use warnings;
use strict;

my $qr = qr/\Q*FOO/;

while (<DATA>) { print if /$qr/ }

__DATA__
//FOO programming using stuff and things
hey *FOO.BAR

Note that this escapes all ASCII non-"word" characters through the rest of the pattern. If you need to limit its action to only a part of the pattern then stop it using \E. Please see linked docs.

The above determines whether *FOO is in the line, regardless of whether it is a word or a part of one. It is not clear to me which is needed. Once that is specified the pattern can be adjusted.

Note that /\*FOO/ works, too. What you tried failed probably because of all the rest that you are trying to match, which purpose I do not understand. If you only need to detect whether the pattern is present the above does it. if there is a more specific requirement please clarify.


As for the examples: for me that string //FOO... is not matched by the main (first) $pattern you show. The second one won't interpolate $word -- but is firstly much too convoluted. The regex can really tie one in nasty knots when pushed; I suggest to keep it simple as much as possible.

On Which Line Number Was the Regex Match Found?

Possible... with Regex Trickery!

Disclaimer: This is not meant to be a practical solution, but an illustration of a way to use an extension of a terrific regex hack. Moreover, it only works on regex engines that allow capture groups to refer to themselves. For instance, you could use it in Notepad++, as it uses the PCRE engine—but not in Java.

Let's say your file is:

some code
more code
hey, hello!
more code

At the bottom of the file, paste :1:2:3:4:5:6:7, where : is a delimiter not found in the rest of the code, and where the numbers go at least as high as the number of lines.

Then, to get the line of the first hello, you can use:

(?m)(?:(?:^(?:(?!hello).)*(?:\r?\n))(?=[^:]+((?(1)\1):\d+)))*.*hello(?=[^:]+((?(1)\1)+:(\d+)))

The line number of the first line containing hello will be captured by Group 2.

  • In the demo, see Group 2 capture in the right pane.
  • The hack relies on a group referring to itself. In the classic @Qtax trick, this is done with (?>\1?). For diversity, I used a conditional instead.

Explanation

  • The first part of the regex is a line skipper, which captures an increasing amount of the the line counter at the bottom to Group 1
  • The second part of the regex matches hello and captures the line number to Group 2
  • Inside the line skipper, (?:^(?:(?!hello).)*(?:\r?\n)) matches a line that doesn't contain hello.
  • Still inside the line skipper, the (?=[^:]+((?(1)\1):\d+)) lookahead gets us to the first : with [^:]+ then the outer parentheses in ((?(1)\1):\d+)) capture to Group 1... if Group 1 is set (?(1)\1) then Group 1, then, regardless, a colon and some digits. This ensures that each time the line skipper matches a line, Group 1 expands to a longer portion of :1:2:3:4:5:6:7
  • The * mataches the line skipper zero or more times
  • .*hello matches the line with hello
  • The lookahead (?=[^:]+((?(1)\1)+:(\d+))) is identical to the one in the line skipper, except that this time the digits are captured to Group 2: (\d+)
  • -

Reference

  • Qtax trick (recently awarded an additional bounty by @AmalMurali)
  • Replace a word with the number of the line on which it is found


Related Topics



Leave a reply



Submit