Are non-capturing groups redundant?
Your (?:wo)?men
and (wo)?men
are semantically equivalent, but technically are different, namely, the first is using a non-capturing and the other a capturing group. Thus, the question is why use non-capturing groups when we have capturing ones?
Non-caprturing groups are of help sometimes.
- To avoid excessive number of backreferences (remember that it is sometimes difficult to use backreferences higher than 9)
- To avoid the problem with 99 numbered backreferences limit (by reducing the number of numbered capturing groups) (source: Regular-expressions.info: Most regex flavors support up to 99 capturing groups and double-digit backreferences.)
NOTE this does not pertain to Java regex engine, nor to PHP or .NET regex engines. - To lessen the overhead caused by storing the captures in the stack
- We can add more groupings to existing regex without ruining the order of capturing groups.
Also, it is just makes our matches cleaner:
You can use a non-capturing group to retain the organisational or grouping benefits but without the overhead of capturing.
It does not seem a good idea to re-factor existing regular expressions to convert capturing to non-capturing groups, since it may ruin the code or require too much effort.
Regex optional non-capturing groups
You really need to use non-capturing optional groups (like (?:...)?
), but besides, you also need anchors (^
to match the start of the string and $
to match the string end) and lazy dot matching patterns (.*?
, to match as few any chars as possible).
You may use
/^[nN]euer [Ff]ilm\s*(.*?)(?:\s*[vV]on\s+(\d{4}))?(?:\s+[Mm]it\s*(.*?)(?:\s*[uU]nd\s*(.*))?)?$/
See the regex demo. In the demo, /gm
modifiers are necessary since the input is a multiline string.
Pattern details:
^
- start of a string anchor[nN]euer [Ff]ilm
-Neuer film
/Neuer Film
/neuer Film
\s*
- zero or more whitespaces(.*?)
- Group 1: any 0+ chars other than line break chars, as few as possible (that is, up to the leftmost occurrence of the subsequent subpatterns)(?:\s*[vV]on\s+(\d{4}))?
- 1 or 0 occurrences of:\s*
- 0+ whitespaces[vV]on
-von
orVon
\s+
- 1+ whitespaces(\d{4})
- Group 2: 4 digits
(?:\s+[Mm]it\s*(.*?)(?:\s*[uU]nd\s*(.*))?)?
- an optional non-capturing group matching 1 or 0 occurrences of:\s+
- 1+ whitespaces[Mm]it
-Mit
ormit
\s*
- 0+ whitespaces(.*?)
- Group 3 matching any 0+ chars other than line break chars, as few as possible(?:\s*[uU]nd\s*(.*))?
- an optional non-capturing group matching\s*[uU]nd\s*
-und
orUnd
enclosed with 0+ whitespaces(.*)
- Group 4 matching any 0+ chars other than line break chars, as many as possible
$
- end of string.
var strs = ['Neuer Film a von 1000','Neuer Film a von 1000 mit b','Neuer Film a von 1000 mit b und c','Neuer Film a von 1000 mit b und c und d','Neuer Film a mit b','Neuer Film a mit b und c','Neuer Film a mit b und c und d'];var rx = /^[nN]euer [Ff]ilm\s*(.*?)(?:\s*[vV]on\s+(\d{4}))?(?:\s+[Mm]it\s*(.*?)(?:\s*[uU]nd\s*(.*))?)?$/;for (var s of strs) { var m = rx.exec(s); if (m) { console.log('-- ' + s + ' ---'); console.log('Group 1: ' + m[1]); if (m[2]) console.log('Group 2: ' + m[2]); if (m[3]) console.log('Group 3: ' + m[3]); if (m[4]) console.log('Group 4: ' + m[4]); } }
Why is regex search slower with capturing groups in Python?
Your patterns only differ in the capturing groups. When you define a capturing group in the regex pattern and use the pattern with re.search
, the result will be a MatchObject
instance. Each match object will contain as many groups as there are capturing groups in the pattern, even if they are empty. That is the overhead for the re
internals: adding the (list of) groups (memory allocation, etc.). Mind that groups also contain such details as the starting and ending index of the text that they match and more (refer to the MatchObject
reference).
python regex return non-capturing group
First of all, a .
in a pattern is a metacharacter that matches any char excluding line break chars. You need to escape the .
in the regex pattern
Also, {1}
limiting quantifier is always redundant, you may safely remove it from any regex you have.
Next, if you need to get a mmylastn
string as a result, you cannot use match.group()
because .group()
fetches the overall match value, not the concatenated capturing group values.
So, in your case,
- Check if there is a match first, trying to access
None.groups()
will throw an exception - Then join the
match.groups()
You can use
import re
def getUsername(email) :
m = re.match(r"(.)[a-z]+\.([a-z]{7})",email.replace('-',''))
if m:
return "".join(m.groups())
return email
print(getUsername("my-firstname.my-lastname@email.com"))
See the Python demo.
REGEX condition with non-capture groups
It doesn't look like you really need that regex condition.
Why not simply use an optional non-capture group:
([A-Za-z0-9_]+?)[ ]?(?:\(([A-Za-z0-9=\-\/°% ]*)\))?_([A-Za-z0-9]*)$
^^^^ ^
regex101 demo
[Note: you have 2 =
signs in the character class, I removed one of them since it's redundant to use two in a character class]
Javascript regex with non-capturing group as two alternatives
You need to match lower- and uppercase letters separately. Currently, your À-ž
range for European letters includes all lower- and uppercase letters, and even some non-letters.
Here are the ranges you need:
Uppercase (basic European)
- Basic Latin — Uppercase Latin alphabet:
[A-Z]
- Latin 1 Supplement — Letter items - Uppercase:
[À-ÖØ-Þ]
- Latin Extended A — European Latin - Uppercase letters:
[ĀĂĄĆĈĊČĎĐĒĔĖĘĚĜĞĠĢĤĦĨĪĬĮİIJijĴĶĹĻĽĿŁŃŅŇŊŌŎŐŒœŔŖŘŚŜŞŠŢŤŦŨŪŬŮŰŲŴŶŸŹŻŽ]
Lowercase (basic European)
- Basic Latin — Lowercase Latin alphabet:
[a-z]
- Latin 1 Supplement — Letter items - Lowercase:
[ß-öø-ÿ]
- Latin Extended A — European Latin - Lowercase letters:
[žſāăąćĉċčďđēĕėęěĝğġģĥħĩīĭįıĵķĸĺļľŀłńņňŋōŏőŕŗřśŝşšţťŧũūŭůűųŵŷźż]
The pattern you need is
/^[UPPER][lower]+(?:[\s'-][UPPER][lower]+)*$/
where UPPER
and lower
are uppercase and lowercase letter ranges/sets.
So, let's build the pattern.
var upper = '[A-ZÀ-ÖØ-ÞĀĂĄĆĈĊČĎĐĒĔĖĘĚĜĞĠĢĤĦĨĪĬĮİIJijĴĶĹĻĽĿŁŃŅŇŊŌŎŐŒœŔŖŘŚŜŞŠŢŤŦŨŪŬŮŰŲŴŶŸŹŻŽ]';var lower = '[a-zß-öø-ÿžſāăąćĉċčďđēĕėęěĝğġģĥħĩīĭįıĵķĸĺļľŀłńņňŋōŏőŕŗřśŝşšţťŧũūŭůűųŵŷźż]';var rx = new RegExp("^" + upper + lower + "+(?:[\\s'-]" + upper + lower + "+)*$");// Let's testvar tests = ['Test ','Test - ','Test-',' test','Test-test','TTest','Test\'test','Test','Test-Test','Test Test','Test\'Test', 'Łóźćż\'żłóźćęą'];for (var s of tests) { console.log(s, '=>', rx.test(s))}
Non capturing group is not available in the replacement output
You may use this regex with a capture group and a lambda
function in re.sub
:
>>> s=r'Sample|text||new line|||cFFFFFF00|HEX|colorText in color|this will be inner new line|cFFFFFFFF|HEX|colorReset color. The following goes into the next line too:||hello world'
>>> print re.sub(r'(\|\|\w{9}\|HEX\|color.*?|([\|])?\|\w{9}\|HEX\|color)|\|', lambda m: m.group(1) if m.group(1) else '\n', s)
Sample
text
new line
||cFFFFFF00|HEX|colorText in color
this will be inner new line|cFFFFFFFF|HEX|colorReset color. The following goes into the next line too:
hello world
In regex we are using a capture group for the text that we want to keep in replacement string.
Code in
lambda
function checks for presence of 1st capture group and if it is there then just puts it back otherwise it replaces|
with\n
.
Related Topics
How to Make a Color Transparent in a Bufferedimage and Save as Png
What Does Biginteger Having No Limit Mean
Java Class.Cast() VS. Cast Operator
How to Get the Session Object If I Have the Entity-Manager
Problem with Assigning an Array to Other Array in Java
What Are the Differences Between "Generic" Types in C++ and Java
How to Call a Method in Another Class of the Same Package
How to Set Icon in a Column of Jtable
How to Add a Utf-8 Bom in Java
How to Make a Jbutton in a Jtable Cell Click-Able
Java Sax Parser Split Calls to Characters()
Dbpedia Jena Query Returning Null
What Is the Easiest Way to Ignore a JPA Field During Persistence
Regex: How to Escape Backslashes and Special Characters
Jackson JSON Deserialization with Root Element