Are Non-Capturing Groups Redundant

Are non-capturing groups redundant?

Your (?:wo)?men and (wo)?men are semantically equivalent, but technically are different, namely, the first is using a non-capturing and the other a capturing group. Thus, the question is why use non-capturing groups when we have capturing ones?

Non-caprturing groups are of help sometimes.

  1. To avoid excessive number of backreferences (remember that it is sometimes difficult to use backreferences higher than 9)
  2. To avoid the problem with 99 numbered backreferences limit (by reducing the number of numbered capturing groups) (source: Regular-expressions.info: Most regex flavors support up to 99 capturing groups and double-digit backreferences.)
    NOTE this does not pertain to Java regex engine, nor to PHP or .NET regex engines.
  3. To lessen the overhead caused by storing the captures in the stack
  4. We can add more groupings to existing regex without ruining the order of capturing groups.

Also, it is just makes our matches cleaner:

You can use a non-capturing group to retain the organisational or grouping benefits but without the overhead of capturing.

It does not seem a good idea to re-factor existing regular expressions to convert capturing to non-capturing groups, since it may ruin the code or require too much effort.

Regex optional non-capturing groups

You really need to use non-capturing optional groups (like (?:...)?), but besides, you also need anchors (^ to match the start of the string and $ to match the string end) and lazy dot matching patterns (.*?, to match as few any chars as possible).

You may use

/^[nN]euer [Ff]ilm\s*(.*?)(?:\s*[vV]on\s+(\d{4}))?(?:\s+[Mm]it\s*(.*?)(?:\s*[uU]nd\s*(.*))?)?$/

See the regex demo. In the demo, /gm modifiers are necessary since the input is a multiline string.

Pattern details:

  • ^ - start of a string anchor
  • [nN]euer [Ff]ilm - Neuer film / Neuer Film / neuer Film
  • \s* - zero or more whitespaces
  • (.*?) - Group 1: any 0+ chars other than line break chars, as few as possible (that is, up to the leftmost occurrence of the subsequent subpatterns)
  • (?:\s*[vV]on\s+(\d{4}))? - 1 or 0 occurrences of:

    • \s* - 0+ whitespaces
    • [vV]on - von or Von
    • \s+ - 1+ whitespaces
    • (\d{4}) - Group 2: 4 digits
  • (?:\s+[Mm]it\s*(.*?)(?:\s*[uU]nd\s*(.*))?)? - an optional non-capturing group matching 1 or 0 occurrences of:

    • \s+ - 1+ whitespaces
    • [Mm]it - Mit or mit
    • \s* - 0+ whitespaces
    • (.*?) - Group 3 matching any 0+ chars other than line break chars, as few as possible
    • (?:\s*[uU]nd\s*(.*))? - an optional non-capturing group matching

      • \s*[uU]nd\s* - und or Und enclosed with 0+ whitespaces
      • (.*) - Group 4 matching any 0+ chars other than line break chars, as many as possible
  • $ - end of string.

var strs = ['Neuer Film a von 1000','Neuer Film a von 1000 mit b','Neuer Film a von 1000 mit b und c','Neuer Film a von 1000 mit b und c und d','Neuer Film a mit b','Neuer Film a mit b und c','Neuer Film a mit b und c und d'];var rx = /^[nN]euer [Ff]ilm\s*(.*?)(?:\s*[vV]on\s+(\d{4}))?(?:\s+[Mm]it\s*(.*?)(?:\s*[uU]nd\s*(.*))?)?$/;for (var s of strs) {   var m = rx.exec(s);   if (m) {     console.log('-- ' + s + ' ---');     console.log('Group 1: ' + m[1]);     if (m[2]) console.log('Group 2: ' + m[2]);     if (m[3]) console.log('Group 3: ' + m[3]);     if (m[4]) console.log('Group 4: ' + m[4]);   }   }

Why is regex search slower with capturing groups in Python?

Your patterns only differ in the capturing groups. When you define a capturing group in the regex pattern and use the pattern with re.search, the result will be a MatchObject instance. Each match object will contain as many groups as there are capturing groups in the pattern, even if they are empty. That is the overhead for the re internals: adding the (list of) groups (memory allocation, etc.). Mind that groups also contain such details as the starting and ending index of the text that they match and more (refer to the MatchObject reference).

python regex return non-capturing group

First of all, a . in a pattern is a metacharacter that matches any char excluding line break chars. You need to escape the . in the regex pattern

Also, {1} limiting quantifier is always redundant, you may safely remove it from any regex you have.

Next, if you need to get a mmylastn string as a result, you cannot use match.group() because .group() fetches the overall match value, not the concatenated capturing group values.

So, in your case,

  • Check if there is a match first, trying to access None.groups() will throw an exception
  • Then join the match.groups()

You can use


import re
def getUsername(email) :
m = re.match(r"(.)[a-z]+\.([a-z]{7})",email.replace('-',''))
if m:
return "".join(m.groups())
return email

print(getUsername("my-firstname.my-lastname@email.com"))

See the Python demo.

REGEX condition with non-capture groups

It doesn't look like you really need that regex condition.

Why not simply use an optional non-capture group:

([A-Za-z0-9_]+?)[ ]?(?:\(([A-Za-z0-9=\-\/°% ]*)\))?_([A-Za-z0-9]*)$
^^^^ ^

regex101 demo

[Note: you have 2 = signs in the character class, I removed one of them since it's redundant to use two in a character class]

Javascript regex with non-capturing group as two alternatives

You need to match lower- and uppercase letters separately. Currently, your À-ž range for European letters includes all lower- and uppercase letters, and even some non-letters.

Here are the ranges you need:

Uppercase (basic European)

  • Basic Latin — Uppercase Latin alphabet: [A-Z]
  • Latin 1 Supplement — Letter items - Uppercase: [À-ÖØ-Þ]
  • Latin Extended A — European Latin - Uppercase letters: [ĀĂĄĆĈĊČĎĐĒĔĖĘĚĜĞĠĢĤĦĨĪĬĮİIJijĴĶĹĻĽĿŁŃŅŇŊŌŎŐŒœŔŖŘŚŜŞŠŢŤŦŨŪŬŮŰŲŴŶŸŹŻŽ]

Lowercase (basic European)

  • Basic Latin — Lowercase Latin alphabet: [a-z]
  • Latin 1 Supplement — Letter items - Lowercase: [ß-öø-ÿ]
  • Latin Extended A — European Latin - Lowercase letters: [žſāăąćĉċčďđēĕėęěĝğġģĥħĩīĭįıĵķĸĺļľŀłńņňŋōŏőŕŗřśŝşšţťŧũūŭůűųŵŷźż]

The pattern you need is

/^[UPPER][lower]+(?:[\s'-][UPPER][lower]+)*$/

where UPPER and lower are uppercase and lowercase letter ranges/sets.

So, let's build the pattern.

var upper = '[A-ZÀ-ÖØ-ÞĀĂĄĆĈĊČĎĐĒĔĖĘĚĜĞĠĢĤĦĨĪĬĮİIJijĴĶĹĻĽĿŁŃŅŇŊŌŎŐŒœŔŖŘŚŜŞŠŢŤŦŨŪŬŮŰŲŴŶŸŹŻŽ]';var lower = '[a-zß-öø-ÿžſāăąćĉċčďđēĕėęěĝğġģĥħĩīĭįıĵķĸĺļľŀłńņňŋōŏőŕŗřśŝşšţťŧũūŭůűųŵŷźż]';var rx = new RegExp("^" + upper + lower + "+(?:[\\s'-]" + upper + lower + "+)*$");// Let's testvar tests = ['Test ','Test - ','Test-',' test','Test-test','TTest','Test\'test','Test','Test-Test','Test Test','Test\'Test', 'Łóźćż\'żłóźćęą'];for (var s of tests) {  console.log(s, '=>', rx.test(s))}

Non capturing group is not available in the replacement output

You may use this regex with a capture group and a lambda function in re.sub:

>>> s=r'Sample|text||new line|||cFFFFFF00|HEX|colorText in color|this will be inner new line|cFFFFFFFF|HEX|colorReset color. The following goes into the next line too:||hello world'
>>> print re.sub(r'(\|\|\w{9}\|HEX\|color.*?|([\|])?\|\w{9}\|HEX\|color)|\|', lambda m: m.group(1) if m.group(1) else '\n', s)
Sample
text

new line
||cFFFFFF00|HEX|colorText in color
this will be inner new line|cFFFFFFFF|HEX|colorReset color. The following goes into the next line too:

hello world
  • In regex we are using a capture group for the text that we want to keep in replacement string.

  • Code in lambda function checks for presence of 1st capture group and if it is there then just puts it back otherwise it replaces | with \n.



Related Topics



Leave a reply



Submit