Python Regular Expressions - How to Capture Multiple Groups from a Wildcard Expression

Python regular expressions - how to capture multiple groups from a wildcard expression?

In addition to Douglas Leeder's solution, here is the explanation:

In regular expressions the group count is fixed. Placing a quantifier behind a group does not increase group count (imagine all other group indexes increment because an eralier group matched more than once).

Groups with quantifiers are the way of making a complex sub-expression atomic, when there is need to match it more than once. The regex engine has no other way than saving the last match only to the group. In short: There is no way to achieve what you want with a single "unarmed" regular expression, and you have to find another way.

Can Regex groups and * wildcards work together?

The problem is you repeat your only capturing group. That means you have only one bracket ==> one capturing group, and this capturing group is overwritten each time when it matches.

See Repeating a Capturing Group vs. Capturing a Repeated Group on regular-expression.info for more information. (But capturing a repeated group is also not what you want)

So, after your regex is done, your capturing group 1 will contain the last found "foo".

This would would give you the expected result:

my_str = "foofoofoofoo"
pattern = "foo"
result = re.findall(pattern, my_str)

result is then a list ['foo', 'foo', 'foo', 'foo']

python regular expression repeating group matches

You can use re.findall for this.

result = re.findall('\s\d+', x)

print result[1] # 2958
print result[3] # 3103

Is there a method to get all groups in regular expression with wildcard in python

You could use re.finditer to iterate the matches, appending each result to an empty tuple:

import re

res = tuple()
matches = re.finditer(r' ([a-z]+) ([0-9]+)', ' a 1 b 2 c 3')
for m in matches:
res = res + m.groups()

Output:

('a', '1', 'b', '2', 'c', '3')

Note that in the regex the outer group is removed as it is not required with finditer.

How do I regex match with grouping with unknown number of groups

What you're looking for is a parser, instead of a regular expression match. In your case, I would consider using a very simple parser, split():

s = "VALUE 100 234 568 9233 119"
a = s.split()
if a[0] == "VALUE":
print [int(x) for x in a[1:]]

You can use a regular expression to see whether your input line matches your expected format (using the regex in your question), then you can run the above code without having to check for "VALUE" and knowing that the int(x) conversion will always succeed since you've already confirmed that the following character groups are all digits.

Return the content of a Wildcard match in Python

Use a capturing group:

>>> import re
>>> re.search('stack(.*)flow', 'stackoverflow').group(1)
'over'

Regex multiple expression

You should use lxml for this.

from lxml import etree
xml = etree.fromstring(xml_string)
ins_tags = xml.xpath('//ins[./insacc]')
for ins_tag in ins_tags:
# do work

Isn't is simple?

perl regex: how to capture multi group use only one express?

It is not possible capture in one expression as explained here: Python regular expressions - how to capture multiple groups from a wildcard expression?

A soultion to your problem might be this:

use strict;
use warnings;
use Data::Dumper;

my $text = "all directions are: nw, sw, se, w, ..., s and e.";
my @capture;
if($text =~ s/all directions are: (\w+),\s+(.*)/$2/) {
push @capture, $1;
while($text =~ s/(\w+),\s+(.*)/$2/) {
push @capture, $1;
}
if($text =~ /(\w+)\s+and\s+(\w+)\./) {
push @capture, $1;
push @capture, $2;
}
}
print Dumper \@capture;

matching any character including newlines in a Python regex subexpression, not globally

To match a newline, or "any symbol" without re.S/re.DOTALL, you may use any of the following:

  1. (?s). - the inline modifier group with s flag on sets a scope where all . patterns match any char including line break chars

  2. Any of the following work-arounds:

[\s\S]
[\w\W]
[\d\D]

The main idea is that the opposite shorthand classes inside a character class match any symbol there is in the input string.

Comparing it to (.|\s) and other variations with alternation, the character class solution is much more efficient as it involves much less backtracking (when used with a * or + quantifier). Compare the small example: it takes (?:.|\n)+ 45 steps to complete, and it takes [\s\S]+ just 2 steps.

See a Python demo where I am matching a line starting with 123 and up to the first occurrence of 3 at the start of a line and including the rest of that line:

import re
text = """abc
123
def
356
more text..."""
print( re.findall(r"^123(?s:.*?)^3.*", text, re.M) )
# => ['123\ndef\n356']
print( re.findall(r"^123[\w\W]*?^3.*", text, re.M) )
# => ['123\ndef\n356']


Related Topics



Leave a reply



Submit