How the Anchor \Z and \G Works in Ruby

How the Anchor \z and \G works in Ruby?

\z matches the end of the input. You are trying to find a match where 4 occurs at the end of the input. Problem is, there is a newline at the end of the input, so you don't find a match. \Z matches either the end of the input or a newline at the end of the input.

So:

/\d\z/

matches the "4" in:

"24"

and:

/\d\Z/

matches the "4" in the above example and the "4" in:

"24\n"

Check out this question for example of using \G:

Examples of regex matcher \G (The end of the previous match) in Java would be nice


UPDATE: Real-World uses for \G

I came up with a more real world example. Say you have a list of words that are separated by arbitrary characters that cannot be well predicted (or there's too many possibilities to list). You'd like to match these words where each word is its own match up until a particular word, after which you don't want to match any more words. For example:

foo,bar.baz:buz'fuzz*hoo-har/haz|fil^bil!bak

You want to match each word until 'har'. You don't want to match 'har' or any of the words that follow. You can do this relatively easily using the following pattern:

/(?<=^|\G\W)\w+\b(?<!har)/

rubular

The first attempt will match the beginning of the input followed by zero non-word character followed by 3 word characters ('foo') followed by a word boundary. Finally, a negative lookbehind assures that the word which has just been matched is not 'har'.

On the second attempt, matching picks back up at the end of the last match. 1 non-word character is matched (',' - though it is not captured due to the lookbehind, which is a zero-width assertion), followed by 3 characters ('bar').

This continues until 'har' is matched, at which point the negative lookbehind is triggered and the match fails. Because all matches are supposed to be "attached" to the last successful match, no additional words will be matched.

The result is:

foo
bar
baz
buz
fuzz
hoo

If you want to reverse it and have all words after 'har' (but, again, not including 'har'), you can use an expression like this:

/(?!^)(?<=har\W|\G\W)\w+\b/

rubular

This will match either a word which is immediately preceeded by 'har' or the end of the last match (except we have to make sure not to match the beginning of the input). The list of matches is:

haz
fil
bil
bak

If you do want to match 'har' and all following words, you could use this:

/\bhar\b|(?!^)(?<=\G\W)\w+\b/

rubular

This produces the following matches:

har
haz
fil
bil
bak

What is the use of '\G' anchor in regex?

UPDATE

\G forces the pattern to only return matches that are part of a continuous chain of matches. From the first match each subsequent match must be preceded by a match. If you break the chain the matches end.

<?php
$pattern = '#(match),#';
$subject = "match,match,match,match,not-match,match";

preg_match_all( $pattern, $subject, $matches );

//Will output match 5 times because it skips over not-match
foreach ( $matches[1] as $match ) {
echo $match . '<br />';
}

echo '<br />';

$pattern = '#(\Gmatch),#';
$subject = "match,match,match,match,not-match,match";

preg_match_all( $pattern, $subject, $matches );

//Will only output match 4 times because at not-match the chain is broken
foreach ( $matches[1] as $match ) {
echo $match . '<br />';
}
?>

This is straight from the docs

The fourth use of backslash is for certain simple assertions. An
assertion specifies a condition that has to be met at a particular
point in a match, without consuming any characters from the subject
string. The use of subpatterns for more complicated assertions is
described below. The backslashed assertions are

 \G
first matching position in subject

The \G assertion is true only when the current matching position is at
the start point of the match, as specified by the offset argument of
preg_match(). It differs from \A when the value of offset is non-zero.

http://www.php.net/manual/en/regexp.reference.escape.php

You will have to scroll down that page a bit but there it is.

There is a really good example in ruby but it is the same in php.

How the Anchor \z and \G works in Ruby?

default value are not properly reflected for the same keys from 2 different but closely related ruby code

  • Hash#default= sets the value to be returned in case there is no such key. You set that to a proc in Part-I, and that is what you see being returned.
  • Hash#default_proc= sets the proc to be called on itself and the key in case there is no such key. If you do hash[2] = 2 + 2, then hash[2] will return 4. If you do hash["cat"] = "cat" + "cat", then hash["cat"] will return "catcat".

When is \G useful application in a regex?

\G is an anchor; it indicates where the match is forced to start. When \G is present, it can't start matching at some arbitrary later point in the string; when \G is absent, it can.

It is most useful in parsing a string into discrete parts, where you don't want to skip past other stuff. For instance:

my $string = " a 1 # ";
while () {
if ( $string =~ /\G\s+/gc ) {
print "whitespace\n";
}
elsif ( $string =~ /\G[0-9]+/gc ) {
print "integer\n";
}
elsif ( $string =~ /\G\w+/gc ) {
print "word\n";
}
else {
print "done\n";
last;
}
}

Output with \G's:

whitespace
word
whitespace
integer
whitespace
done

without:

whitespace
whitespace
whitespace
whitespace
done

Note that I am demonstrating using scalar-context /g matching, but \G applies equally to list context /g matching and in fact the above code is trivially modifiable to use that:

my $string = " a 1 # ";
my @matches = $string =~ /\G(?:(\s+)|([0-9]+)|(\w+))/g;
while ( my ($whitespace, $integer, $word) = splice @matches, 0, 3 ) {
if ( defined $whitespace ) {
print "whitespace\n";
}
elsif ( defined $integer ) {
print "integer\n";
}
elsif ( defined $word ) {
print "word\n";
}
}

How to validate the format of a string in Ruby, while extracting the matches?

Yes, you may use

s.scan(/(?:\G(?!\A)|\A(?=(?:#\d\s*)*\z))\s*\K#\d/)

See the regex demo

Details

  • (?:\G(?!\A)|\A(?=(?:#\d\s*)*\z)) - two alternatives:

    • \G(?!\A) - the end of the previous successful match
    • | - or
    • \A(?=(?:#\d\s*)*\z) - start of string (\A) that is followed with 0 or more repetitions of # + digit + 0+ whitespaces and then followed with the end of string
  • \s* - 0+ whitespace chars
  • \K - match reset operator discarding the text matched so far
  • #\d - a # char and then a digit

In short: the start of string position is matched first, but only if the string to the right (i.e. the whole string) matches the pattern you want. Since that check is performed with a lookahead, the regex index stays where it was, and then matching occurs all the time ONLY after a valid match thanks to the \G operator (it matches the start of string or end of previous match, so (?!\A) is used to subtract the start string position).

Ruby demo:

rx = /(?:\G(?!\A)|\A(?=(?:#\d\s*)*\z))\s*\K#\d/
p "#1 #2".scan(rx)
# => ["#1", "#2"]
p "#1 NO #2".scan(rx)
# => []

How to split a long regular expression into multiple lines in JavaScript?

[Edit 2022/08] Created a small github repository to create regular expressions with spaces, comments and templating.


You could convert it to a string and create the expression by calling new RegExp():

var myRE = new RegExp (['^(([^<>()[\]\\.,;:\\s@\"]+(\\.[^<>(),[\]\\.,;:\\s@\"]+)*)',
'|(\\".+\\"))@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.',
'[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\\.)+',
'[a-zA-Z]{2,}))$'].join(''));

Notes:

  1. when converting the expression literal to a string you need to escape all backslashes as backslashes are consumed when evaluating a string literal. (See Kayo's comment for more detail.)

  2. RegExp accepts modifiers as a second parameter

    /regex/g => new RegExp('regex', 'g')

[Addition ES20xx (tagged template)]

In ES20xx you can use tagged templates. See the snippet.

Note:

  • Disadvantage here is that you can't use plain whitespace in the regular expression string (always use \s, \s+, \s{1,x}, \t, \n etc).

(() => {
const createRegExp = (str, opts) =>
new RegExp(str.raw[0].replace(/\s/gm, ""), opts || "");
const yourRE = createRegExp`
^(([^<>()[\]\\.,;:\s@\"]+(\.[^<>()[\]\\.,;:\s@\"]+)*)|
(\".+\"))@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|
(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$`;
console.log(yourRE);
const anotherLongRE = createRegExp`
(\byyyy\b)|(\bm\b)|(\bd\b)|(\bh\b)|(\bmi\b)|(\bs\b)|(\bms\b)|
(\bwd\b)|(\bmm\b)|(\bdd\b)|(\bhh\b)|(\bMI\b)|(\bS\b)|(\bMS\b)|
(\bM\b)|(\bMM\b)|(\bdow\b)|(\bDOW\b)
${"gi"}`;
console.log(anotherLongRE);
})();


Related Topics



Leave a reply



Submit