What Is the Use of '\G' Anchor in Regex

What is the use of '\G' anchor in regex?

UPDATE

\G forces the pattern to only return matches that are part of a continuous chain of matches. From the first match each subsequent match must be preceded by a match. If you break the chain the matches end.

<?php
$pattern = '#(match),#';
$subject = "match,match,match,match,not-match,match";

preg_match_all( $pattern, $subject, $matches );

//Will output match 5 times because it skips over not-match
foreach ( $matches[1] as $match ) {
echo $match . '<br />';
}

echo '<br />';

$pattern = '#(\Gmatch),#';
$subject = "match,match,match,match,not-match,match";

preg_match_all( $pattern, $subject, $matches );

//Will only output match 4 times because at not-match the chain is broken
foreach ( $matches[1] as $match ) {
echo $match . '<br />';
}
?>

This is straight from the docs

The fourth use of backslash is for certain simple assertions. An
assertion specifies a condition that has to be met at a particular
point in a match, without consuming any characters from the subject
string. The use of subpatterns for more complicated assertions is
described below. The backslashed assertions are

 \G
first matching position in subject

The \G assertion is true only when the current matching position is at
the start point of the match, as specified by the offset argument of
preg_match(). It differs from \A when the value of offset is non-zero.

http://www.php.net/manual/en/regexp.reference.escape.php

You will have to scroll down that page a bit but there it is.

There is a really good example in ruby but it is the same in php.

How the Anchor \z and \G works in Ruby?

Purpose of the \G anchor in regular expressions

\G is an anchor which matches the previous match position.

On the first pass, \G is equivalent to \A, which is the start of the string anchor. Since \d\d\A will never match anything (because how can you have two digits before the start of the string?), \d\d\G will also never match anything.

When is \G useful application in a regex?

\G is an anchor; it indicates where the match is forced to start. When \G is present, it can't start matching at some arbitrary later point in the string; when \G is absent, it can.

It is most useful in parsing a string into discrete parts, where you don't want to skip past other stuff. For instance:

my $string = " a 1 # ";
while () {
if ( $string =~ /\G\s+/gc ) {
print "whitespace\n";
}
elsif ( $string =~ /\G[0-9]+/gc ) {
print "integer\n";
}
elsif ( $string =~ /\G\w+/gc ) {
print "word\n";
}
else {
print "done\n";
last;
}
}

Output with \G's:

whitespace
word
whitespace
integer
whitespace
done

without:

whitespace
whitespace
whitespace
whitespace
done

Note that I am demonstrating using scalar-context /g matching, but \G applies equally to list context /g matching and in fact the above code is trivially modifiable to use that:

my $string = " a 1 # ";
my @matches = $string =~ /\G(?:(\s+)|([0-9]+)|(\w+))/g;
while ( my ($whitespace, $integer, $word) = splice @matches, 0, 3 ) {
if ( defined $whitespace ) {
print "whitespace\n";
}
elsif ( defined $integer ) {
print "integer\n";
}
elsif ( defined $word ) {
print "word\n";
}
}

Anchor to End of Last Match

You can use the following regex with re.search:

,?\s*([^',]*(?:'[^']*'[^',]*)*)

See regex demo (I change it to ,?[ ]*([^',\n]*(?:'[^'\n]*'[^',\n]*)*) since it is a multiline demo)

Here, the regex matches (in a regex meaning of the word)...

  • ,? - 1 or 0 comma
  • \s* - 0 or more whitespace
  • ([^',]*(?:'[^']*'[^',]*)*) - Group 1 storing a captured text that consists of...

    • [^',]* - 0 or more characters other than , and '
    • (?:'[^']*'[^',]*)* - 0 or more sequences of ...

      • '[^']*' - a 'string'-like substring containing no apostrophes
      • [^',]* - 0 or more characters other than , and '.

If you want to use a re.match and store the captured texts inside capturing groups, it is not possible since Python regex engine does not store all the captures in a stack as .NET regex engine does with CaptureCollection.

Also, Python regex does not support \G operator, so you cannot anchor any subpattern at the end of a successful match here.

As an alternative/workaround, you can use the following Python code to return successive matches and then the rest of the string:

import re

def successive_matches(pattern,text,pos=0):
ptrn = re.compile(pattern)
match = ptrn.match(text,pos)
while match:
yield match.group()
if match.end() == pos:
break
pos = match.end()
match = ptrn.match(text,pos)
if pos < len(text) - 1:
yield text[pos:]

for matched_text in successive_matches(r"('[^']*'|[^',]*),\s*","21, 2, '23.5R25 ETADT', 'description, with a comma'"):
print matched_text

See IDEONE demo, the output is

21, 
2,
'23.5R25 ETADT',
'description, with a comma'

Anchors in .NET regular expressions

The word boundary \b matches between non-word and word characters, and also at the start of the string if the first character is a word character, and at the end if the last character is a word character.

Thus, \A\b[0-9a-fA-F]+\b\Z is equal to \A[0-9a-fA-F]+\Z because all the characters in the string must be word characters ([0-9] digits or [a-fA-F] letters) for the pattern to match it.

It would be a different story in this case: \A\b[0-9a-fA-F-]+\b\Z that would only match strings with word characters at the beginning and end.

Use \z to match a whole string, with no \n allowed at the end.

Regular expression to match a line that doesn't contain a word

The notion that regex doesn't support inverse matching is not entirely true. You can mimic this behavior by using negative look-arounds:

^((?!hede).)*$

Non-capturing variant:

^(?:(?!:hede).)*$

The regex above will match any string, or line without a line break, not containing the (sub)string 'hede'. As mentioned, this is not something regex is "good" at (or should do), but still, it is possible.

And if you need to match line break chars as well, use the DOT-ALL modifier (the trailing s in the following pattern):

/^((?!hede).)*$/s

or use it inline:

/(?s)^((?!hede).)*$/

(where the /.../ are the regex delimiters, i.e., not part of the pattern)

If the DOT-ALL modifier is not available, you can mimic the same behavior with the character class [\s\S]:

/^((?!hede)[\s\S])*$/

Explanation

A string is just a list of n characters. Before, and after each character, there's an empty string. So a list of n characters will have n+1 empty strings. Consider the string "ABhedeCD":

    ┌──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┐
S = │e1│ A │e2│ B │e3│ h │e4│ e │e5│ d │e6│ e │e7│ C │e8│ D │e9│
└──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┘

index 0 1 2 3 4 5 6 7

where the e's are the empty strings. The regex (?!hede). looks ahead to see if there's no substring "hede" to be seen, and if that is the case (so something else is seen), then the . (dot) will match any character except a line break. Look-arounds are also called zero-width-assertions because they don't consume any characters. They only assert/validate something.

So, in my example, every empty string is first validated to see if there's no "hede" up ahead, before a character is consumed by the . (dot). The regex (?!hede). will do that only once, so it is wrapped in a group, and repeated zero or more times: ((?!hede).)*. Finally, the start- and end-of-input are anchored to make sure the entire input is consumed: ^((?!hede).)*$

As you can see, the input "ABhedeCD" will fail because on e3, the regex (?!hede) fails (there is "hede" up ahead!).

line anchor behavior with perl regex

You see your current behavior because in example file the second line has \n character at the end. \n is the space which matched by \s


perlretut

no modifiers: Default behavior. ... '$' matches only at the end or before a newline at the end.

At your regex \s matches a whitespace character, the set [\ \t\v\r\n\f]. In other words it matches the spaces and \n character. Then $ matches the end of line (no characters, just the position itself). Like word anchor \b matches word boundary, and ^ matches the beginning of the line and not the first character

You could rewrite your regex like this:

/[\t ]+$/

The content of example would look like this if second line didn't end with a \n character:

£ cat example
I have space after me
I do not£

NOTICE that shell prompt £ is not on next line


The results are different because grep abstracts out line endings like Perl's -l flag. (grep -P '\n' will return no results on a text file where grep -Pz '\n' will.)

How the Anchor \z and \G works in Ruby?

\z matches the end of the input. You are trying to find a match where 4 occurs at the end of the input. Problem is, there is a newline at the end of the input, so you don't find a match. \Z matches either the end of the input or a newline at the end of the input.

So:

/\d\z/

matches the "4" in:

"24"

and:

/\d\Z/

matches the "4" in the above example and the "4" in:

"24\n"

Check out this question for example of using \G:

Examples of regex matcher \G (The end of the previous match) in Java would be nice


UPDATE: Real-World uses for \G

I came up with a more real world example. Say you have a list of words that are separated by arbitrary characters that cannot be well predicted (or there's too many possibilities to list). You'd like to match these words where each word is its own match up until a particular word, after which you don't want to match any more words. For example:

foo,bar.baz:buz'fuzz*hoo-har/haz|fil^bil!bak

You want to match each word until 'har'. You don't want to match 'har' or any of the words that follow. You can do this relatively easily using the following pattern:

/(?<=^|\G\W)\w+\b(?<!har)/

rubular

The first attempt will match the beginning of the input followed by zero non-word character followed by 3 word characters ('foo') followed by a word boundary. Finally, a negative lookbehind assures that the word which has just been matched is not 'har'.

On the second attempt, matching picks back up at the end of the last match. 1 non-word character is matched (',' - though it is not captured due to the lookbehind, which is a zero-width assertion), followed by 3 characters ('bar').

This continues until 'har' is matched, at which point the negative lookbehind is triggered and the match fails. Because all matches are supposed to be "attached" to the last successful match, no additional words will be matched.

The result is:

foo
bar
baz
buz
fuzz
hoo

If you want to reverse it and have all words after 'har' (but, again, not including 'har'), you can use an expression like this:

/(?!^)(?<=har\W|\G\W)\w+\b/

rubular

This will match either a word which is immediately preceeded by 'har' or the end of the last match (except we have to make sure not to match the beginning of the input). The list of matches is:

haz
fil
bil
bak

If you do want to match 'har' and all following words, you could use this:

/\bhar\b|(?!^)(?<=\G\W)\w+\b/

rubular

This produces the following matches:

har
haz
fil
bil
bak

VBA RegEx Continuous Matching

You are usin VBScript_RegExp_55.RegExp, not the NET regex engine. It does not support \G anchor as VBA regex engine is ECMA-5 standard compliant.

To only match 2 or more names separated with whitespace, use

\d{1,2}-[A-Z]\.[A-Z][a-z]*(?:\s+\d{1,2}-[A-Z]\.[A-Z][a-z]*)+

Basically, it is <SINGLE_NAME_PATTERN>(?:\s+<SINGLE_NAME_PATTERN>)+, where (?:...) is a non-capturing group that is used to group subpatterns, \s+ (one or more whitespaces) and the name subpattern, and the whole group is matched 1 or more times (thanks to the + quantifier at the end).

See the regex demo. Perhaps, it is also a good idea to add word boundaries \b at the start and end of the regex pattern.

If you need to get them as separate entities, just split the match.

regular expression for finding 'href' value of a a link

I'd recommend using an HTML parser over a regex, but still here's a regex that will create a capturing group over the value of the href attribute of each links. It will match whether double or single quotes are used.

<a\s+(?:[^>]*?\s+)?href=(["'])(.*?)\1

You can view a full explanation of this regex at here.

Snippet playground:

const linkRx = /<a\s+(?:[^>]*?\s+)?href=(["'])(.*?)\1/;const textToMatchInput = document.querySelector('[name=textToMatch]');
document.querySelector('button').addEventListener('click', () => { console.log(textToMatchInput.value.match(linkRx));});
<label>  Text to match:  <input type="text" name="textToMatch" value='<a href="google.com"'>    <button>Match</button> </label>


Related Topics



Leave a reply



Submit