Why Isn't the Regular Expression's "Non-Capturing" Group Working

Why isn't the regular expression's non-capturing group working?

group() and group(0) will return the entire match. Subsequent groups are actual capture groups.

>>> print (re.match(r"(?:aaa)(_bbb)", string1).group(0))
aaa_bbb
>>> print (re.match(r"(?:aaa)(_bbb)", string1).group(1))
_bbb
>>> print (re.match(r"(?:aaa)(_bbb)", string1).group(2))
Traceback (most recent call last):
File "<stdin>", line 1, in ?
IndexError: no such group

If you want the same behavior than group():

" ".join(re.match(r"(?:aaa)(_bbb)", string1).groups())

non-capture group still showing in match

The entire match will always be group 0, you need to access that specific group (group 1 in this case since the first group is non-capture), you can do it like this:

var str = "<p model='cat'></p>";var regex = /(?:model=')(.*)(?:')/gvar match = regex.exec(str);alert(match[1]); // cat

When do we need non-capturing groups?

Non capturing group help to don't get unwanted data in capturing groups.

For instance you string look like

abc and bcd
def or cef

Here you want to capture first and third column data which is separated by and && or. so you write the regex as follows

(\w+)\s+(and|or)\s+(\w+) 

Here $1 contain first column

abc def

then $3 contain

bcd cef

and then unnecessary data stored in to the $2 which is and or. In this case you don't want to store the unnecessary data so will use non capturing group.

(\w+)\s+(?:and|or)\s+(\w+) 

Here $1 contain

abc 
def

$2 contain

bcd
def

And will get the exact data from the non capturing group.

For example

(?:don't (want))

Now the $1 contain the data want.

Then it also help to perform the | condition inside grouping. For example

(?:don't(want)|some(what))

In the above example $1 contain the data want and the $2 contain the data what.

Regex/Python - why is non capturing group captured in this case?

You need to use a look-behind instead of a non-capturing group if you want to check a substring for presence/absence, but exclude it from the match:

import re
s = "Monday, Tuesday, Wednesday, Thursday, Friday, Saturday:"
print(re.sub(r"[\r\n\t]|(?<!\d):",'',s))
# ^^^^^^^
# Result: Monday, Tuesday, Wednesday, Thursday, Friday, Saturday

See IDEONE demo

Here, (?<!\d) only checks if the preceding character before a colon is not a digit.

Also, alternation involves additional overhead, character class [\r\n\t] is preferable, and you do not need any capturing groups (round brackets) since you are not using them at all.

Also, please note that the regex is initialized with a raw string literal to avoid overescaping.

Some more details from Python Regular Expression Syntax regarding non-capturing groups and negative look-behinds:

(?<!...)

- Matches if the current position in the string is not preceded by a match for .... This is called a negative lookbehind assertion. Similar to positive lookbehind assertions, the contained pattern must only match strings of some fixed length and shouldn’t contain group references. Patterns which start with negative lookbehind assertions may match at the beginning of the string being searched.

(?:...)

- A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.

As look-behinds are zero-width assertions (=expressions returning true or false without moving the index any further in the string), they are exactly what you need in this case where you want to check but not match. A non-capturing group will consume part of the string and thus will be part of the match.

What does non capturing group inside a look ahead does?

It does the same thing it does outside of a lookahead.

Consider the following regex:

(\d+)(?=(b|c))

And searching the string '123c'

See regex demo

For example, in Python:

import re

m = re.search(r'(\d+)(?=(b|c))', '123c')
print(m.group(1), m.group(2))

Prints:

123 c

But with ...

(\d+)(?=(?:b|c))

... there is only capture group 1.

Regex including what is supposed to be non-capturing group in result

A (?:...) is a non-capturing group that matches and still consumes the text. It means the part of text this group matches is still added to the overall match value.

In general, if you want to match something but not consume, you need to use lookarounds. So, if you need to match something that is followed with a specific string, use a positive lookahead, (?=...) construct:

some_pattern(?=specific string) // if specific string comes immmediately after pattern
some_pattern(?=.*specific string) // if specific string comes anywhere after pattern

If you need to match but "exclude from match" some specific text before, use a positive lookbehind:

(?<=specific string)some_pattern // if specific string comes immmediately before pattern
(?<=specific string.*?)some_pattern // if specific string comes anywhere before pattern

Note that .*? or .* - that is, patterns with *, +, ?, {2,} or even {1,3} quantifiers - in lookbehind patterns are not always supported by regex engines, however, C# .NET regex engine luckily supports them. They are also supported by Python PyPi regex module, Vim, JGSoft software and now by ECMAScript 2018 compliant JavaScript environments.

In this case, you may capture what you need to get and just match the context without capturing:

var testEcl = "\"D:\\src\\repos\\myprj\\bin\\Debug\\MyApp.exe\" /?";
var asmName = string.Empty;
var m = Regex.Match(testEcl, @"([^\\]+)\.exe", RegexOptions.IgnoreCase);
if (m.Success)
{
asmName = m.Groups[1].Value;
}
Console.WriteLine(asmName);

See the C# demo

Details

  • ([^\\]+) - Capturing group 1: one or more chars other than \
  • \. - a literal dot
  • exe - a literal exe substring.

Since we are only interested in capturing group #1 contents, we grab m.Groups[1].Value, and not the whole m.Value (that contains .exe).

Non-capturing group gets displayed in C#

The code is ignoring the capturing groups.

string line = @"DCS120170517220207-FIC-023.FLW  07-FIC-023    00060Y000000011.266525G";
string patDate = @"(?:^.{4})([2-9][0-9]{3}[0-1][0-9][0-3][0-9])";

Match m = Regex.Match(line, patDate);

foreach (Group g in m.Groups)
{
Console.WriteLine($"{g.Index}: {g.Value}");
}

m.Value is group zero -- the entire match, irrespective of groupings. Since you wisely marked the first group as non-capturing, group 1 is the date.

I suggest naming your capturing groups, for ease of maintenance:

string line = @"DCS120170517220207-FIC-023.FLW  07-FIC-023    00060Y000000011.266525G";
string patDate = @"(?:^.{4})(?<date>[2-9][0-9]{3}[0-1][0-9][0-3][0-9])";

Match m = Regex.Match(line, patDate);

var date = m.Groups["date"].Value;

Update

Wiktor Stribiżew observes that the non-capturing group is otiose. The following pattern will behave identically to your original pattern. The first capturing group is still m.Groups[1], however, because m.Groups[0] is always the entire match, irrespective of groups.

string patDate = @"^.{4}(?<date>[2-9][0-9]{3}[0-1][0-9][0-3][0-9])";

How to use regex non-capturing groups format in Python

It isn't included in the inner group, but it's still included as part of the outer group. A non-capturing group does't necessarily imply it isn't captured at all... just that that group does not explicitly get saved in the output. It is still captured as part of any enclosing groups.

Just do not put them into the () that define the capturing:

import pandas as pd

df = pd.DataFrame(
{'a' : [1,2,3,4],
'b' : ['41u -428u', '31u - 68u', '11u - 58u', '21u - 318u']
})

df['b'].str.extract(r'- ?(\d+)u', expand=True)

0
0 428
1 68
2 58
3 318

That way you match anything that has a '-' in front (mabye followed by a aspace), a 'u' behind and numbers between the both.

Where,

-      # literal hyphen
\s? # optional space—or you could go with \s* if you expect more than one
(\d+) # capture one or more digits
u # literal "u"

Why aren't these non-capturing regex groups working right?

The capture group overrides each previous match. Capture group #1 first matches "1px", then capture group #1 matches "solid" overwriting "1px", then it matches "rgb(255, 255, 255)" overwriting "solid", etc.

Regex - Non capturing group not working

try using lookbehind assertion

$regex = "(?<=\[').*?(?=')"

or:

$regex = "(?:\[\[')(.*?)(?=')"

$yourstring -match $regex

$Matches[1]


Related Topics



Leave a reply



Submit