Regular Expression with Variable Number of Groups

How do I regex match with grouping with unknown number of groups

What you're looking for is a parser, instead of a regular expression match. In your case, I would consider using a very simple parser, split():

s = "VALUE 100 234 568 9233 119"
a = s.split()
if a[0] == "VALUE":
print [int(x) for x in a[1:]]

You can use a regular expression to see whether your input line matches your expected format (using the regex in your question), then you can run the above code without having to check for "VALUE" and knowing that the int(x) conversion will always succeed since you've already confirmed that the following character groups are all digits.

python: regex - catch variable number of groups

In short, it's impossible to do all of this in the re engine. You cannot generate more groups dynamically. It will all put it in one group. You should re-parse the results like so:

import re
input_str = ("TABLE_ENTRY.0[0x1234]= <FIELD_1=0x1234, FIELD_2=0x1234, FIELD_3=0x1234>\n"
"TABLE_ENTRY.1[0x1235]= <FIELD_1=0x1235, FIELD_2=0x1235, FIELD_3=0x1235>")
results = {}
for match in re.finditer(r"([A-Z_0-9\.]+\[0x[0-9A-F]+\])=\s+<([^>]*)>", input_str):
fields = match.group(2).split(", ")
results[match.group(1)] = dict(f.split("=") for f in fields)

>>> results
{'TABLE_ENTRY.0[0x1234]': {'FIELD_2': '0x1234', 'FIELD_1': '0x1234', 'FIELD_3': '0x1234'}, 'TABLE_ENTRY.1[0x1235]': {'FIELD_2': '0x1235', 'FIELD_1': '0x1235', 'FIELD_3': '0x1235'}}

The output will just be a large dict consisting of a table entry, to a dict of it's fields.

It's also rather convinient as you may do this:

>>> results["TABLE_ENTRY.0[0x1234]"]["FIELD_2"]
'0x1234'

I personally suggest stripping off "TABLE_ENTRY" as it's repetative but as you wish.

Regular expression with variable number of groups?

According to the documentation, Java regular expressions can't do this:

The captured input associated with a
group is always the subsequence that
the group most recently matched. If a
group is evaluated a second time
because of quantification then its
previously-captured value, if any,
will be retained if the second
evaluation fails. Matching the string
"aba" against the expression (a(b)?)+,
for example, leaves group two set to
"b". All captured input is discarded
at the beginning of each match.

(emphasis added)

Regex that grabs variable number of groups

I'd do something like:

from collections import defaultdict
import re

comment_line = re.compile(r"\s*#")
matches = defaultdict(dict)

with open('path/to/file.txt') as inf:
d = {} # should catch and dispose of any matching lines
# not related to a class
for line in inf:
if comment_line.match(line):
continue # skip this line
if line.startswith('class '):
classname = line.split()[1]
d = matches[classname]
if line.startswith('model'):
d['model'] = line.split('=')[1].strip()
if line.startswith('fields'):
d['fields'] = line.split('=')[1].strip()
if line.startswith('write_once_fields'):
d['write_once_fields'] = line.split('=')[1].strip()
if line.startswith('required_fields'):
d['required_fields'] = line.split('=')[1].strip()

You could probably do this easier with regex matching.

comment_line = re.compile(r"\s*#")
class_line = re.compile(r"class (?P<classname>)")
possible_keys = ["model", "fields", "write_once_fields", "required_fields"]
data_line = re.compile(r"\s*(?P<key>" + "|".join(possible_keys) +
r")\s+=\s+(?P<value>.*)")

with open( ...
d = {} # default catcher as above
for line in ...
if comment_line.match(line):
continue
class_match = class_line.match(line)
if class_match:
d = matches[class_match.group('classname')]
continue # there won't be more than one match per line
data_match = data_line.match(line)
if data_match:
key,value = data_match.group('key'), data_match.group('value')
d[key] = value

But this might be harder to understand. YMMV.

Regular expression: matching and grouping a variable number of space separated words

re.match returns result at the start of the string. Use re.search instead.

.*? returns the shortest match between two words/expressions (. means anything, * means 0 or more occurrences and ? means shortest match).

import re
my_str = "foo hello world baz 33"
my_pattern = r'foo\s(.*?)\sbaz'
p = re.search(my_pattern,my_str,re.I)
result = p.group(1).split()
print result

['hello', 'world']

EDIT:

In case foo or baz are missing, and you need to return the entire string, use an if-else:

if p is not None:
result = p.group(1).split()
else:
result = my_str

Why the ? in the pattern:

Suppose there are multiple occurrences of the word baz:

my_str =  "foo hello world baz 33 there is another baz"  

using pattern = 'foo\s(.*)\sbaz' will match(longest and greedy) :

'hello world baz 33 there is another'

whereas , using pattern = 'foo\s(.*?)\sbaz' will return the shortest match:

'hello world'

Regex with variable number of groups in ruby or a workaround

Anyway, this one works using the \G construct for version 1.93 on rubular.

In a single match, it grabs the first 5 pts and skips the 6th, then repeats.

(?:(?!^)\G[-,\s]|C)\s*(-?\d+(?:\.\d+)?(?:[eE][+-]?\d+)?)[-,\s](-?\d+(?:\.\d+)?(?:[eE][+-]?\d+)?)[-,\s](-?\d+(?:\.\d+)?(?:[eE][+-]?\d+)?)[-,\s](-?\d+(?:\.\d+)?(?:[eE][+-]?\d+)?)[-,\s](-?\d+(?:\.\d+)?(?:[eE][+-]?\d+)?)(?:[-,\s]-?\d+(?:\.\d+)?(?:[eE][+-]?\d+)?)?

Explained

 (?:
(?! ^ ) # Not BOS
\G # Start where last match left off to get next 5 pts.
[-,\s] # required separator
| # or,
C # C - the start of a block of pts.
)
# The first/next 5 pts. captured
\s*
( # (1 start)
-? \d+
(?: \. \d+ )?
(?: [eE] [+-]? \d+ )?
) # (1 end)
[-,\s]
( # (2 start)
-? \d+
(?: \. \d+ )?
(?: [eE] [+-]? \d+ )?
) # (2 end)
[-,\s]
( # (3 start)
-? \d+
(?: \. \d+ )?
(?: [eE] [+-]? \d+ )?
) # (3 end)
[-,\s]
( # (4 start)
-? \d+
(?: \. \d+ )?
(?: [eE] [+-]? \d+ )?
) # (4 end)
[-,\s]
( # (5 start)
-? \d+
(?: \. \d+ )?
(?: [eE] [+-]? \d+ )?
) # (5 end)

(?: # Skip the 6th pt.
[-,\s]
-? \d+
(?: \. \d+ )?
(?: [eE] [+-]? \d+ )?
)?

How to capture multiple repeated groups?

With one group in the pattern, you can only get one exact result in that group. If your capture group gets repeated by the pattern (you used the + quantifier on the surrounding non-capturing group), only the last value that matches it gets stored.

You have to use your language's regex implementation functions to find all matches of a pattern, then you would have to remove the anchors and the quantifier of the non-capturing group (and you could omit the non-capturing group itself as well).

Alternatively, expand your regex and let the pattern contain one capturing group per group you want to get in the result:

^([A-Z]+),([A-Z]+),([A-Z]+)$

Regular expression to match page number groups

This worked for me - am I missing something?

Sub Pages()

Dim re As Object, allMatches, m, rv, sep, c As Range, i As Long

Set re = CreateObject("VBScript.RegExp")
re.Pattern = "(\d+(-\d+)?)"
re.ignorecase = True
re.MultiLine = True
re.Global = True

For Each c In Range("B5:B20").Cells 'for example
c.Offset(0, 1).Resize(1, 10).ClearContents 'clear output cells
i = 0
If re.test(c.Value) Then
Set allMatches = re.Execute(c.Value)
For Each m In allMatches
i = i + 1
c.Offset(0, i).Value = m
Next m
End If
Next c

End Sub

Sample Image



Related Topics



Leave a reply



Submit