How do I regex match with grouping with unknown number of groups
What you're looking for is a parser, instead of a regular expression match. In your case, I would consider using a very simple parser, split()
:
s = "VALUE 100 234 568 9233 119"
a = s.split()
if a[0] == "VALUE":
print [int(x) for x in a[1:]]
You can use a regular expression to see whether your input line matches your expected format (using the regex in your question), then you can run the above code without having to check for "VALUE"
and knowing that the int(x)
conversion will always succeed since you've already confirmed that the following character groups are all digits.
python: regex - catch variable number of groups
In short, it's impossible to do all of this in the re
engine. You cannot generate more groups dynamically. It will all put it in one group. You should re-parse the results like so:
import re
input_str = ("TABLE_ENTRY.0[0x1234]= <FIELD_1=0x1234, FIELD_2=0x1234, FIELD_3=0x1234>\n"
"TABLE_ENTRY.1[0x1235]= <FIELD_1=0x1235, FIELD_2=0x1235, FIELD_3=0x1235>")
results = {}
for match in re.finditer(r"([A-Z_0-9\.]+\[0x[0-9A-F]+\])=\s+<([^>]*)>", input_str):
fields = match.group(2).split(", ")
results[match.group(1)] = dict(f.split("=") for f in fields)
>>> results
{'TABLE_ENTRY.0[0x1234]': {'FIELD_2': '0x1234', 'FIELD_1': '0x1234', 'FIELD_3': '0x1234'}, 'TABLE_ENTRY.1[0x1235]': {'FIELD_2': '0x1235', 'FIELD_1': '0x1235', 'FIELD_3': '0x1235'}}
The output will just be a large dict consisting of a table entry, to a dict of it's fields.
It's also rather convinient as you may do this:
>>> results["TABLE_ENTRY.0[0x1234]"]["FIELD_2"]
'0x1234'
I personally suggest stripping off "TABLE_ENTRY" as it's repetative but as you wish.
Regular expression with variable number of groups?
According to the documentation, Java regular expressions can't do this:
The captured input associated with a
group is always the subsequence that
the group most recently matched. If a
group is evaluated a second time
because of quantification then its
previously-captured value, if any,
will be retained if the second
evaluation fails. Matching the string
"aba" against the expression (a(b)?)+,
for example, leaves group two set to
"b". All captured input is discarded
at the beginning of each match.
(emphasis added)
Regex that grabs variable number of groups
I'd do something like:
from collections import defaultdict
import re
comment_line = re.compile(r"\s*#")
matches = defaultdict(dict)
with open('path/to/file.txt') as inf:
d = {} # should catch and dispose of any matching lines
# not related to a class
for line in inf:
if comment_line.match(line):
continue # skip this line
if line.startswith('class '):
classname = line.split()[1]
d = matches[classname]
if line.startswith('model'):
d['model'] = line.split('=')[1].strip()
if line.startswith('fields'):
d['fields'] = line.split('=')[1].strip()
if line.startswith('write_once_fields'):
d['write_once_fields'] = line.split('=')[1].strip()
if line.startswith('required_fields'):
d['required_fields'] = line.split('=')[1].strip()
You could probably do this easier with regex matching.
comment_line = re.compile(r"\s*#")
class_line = re.compile(r"class (?P<classname>)")
possible_keys = ["model", "fields", "write_once_fields", "required_fields"]
data_line = re.compile(r"\s*(?P<key>" + "|".join(possible_keys) +
r")\s+=\s+(?P<value>.*)")
with open( ...
d = {} # default catcher as above
for line in ...
if comment_line.match(line):
continue
class_match = class_line.match(line)
if class_match:
d = matches[class_match.group('classname')]
continue # there won't be more than one match per line
data_match = data_line.match(line)
if data_match:
key,value = data_match.group('key'), data_match.group('value')
d[key] = value
But this might be harder to understand. YMMV.
Regular expression: matching and grouping a variable number of space separated words
re.match
returns result at the start of the string. Use re.search
instead..*?
returns the shortest match between two words/expressions (. means anything, * means 0 or more occurrences and ? means shortest match).
import re
my_str = "foo hello world baz 33"
my_pattern = r'foo\s(.*?)\sbaz'
p = re.search(my_pattern,my_str,re.I)
result = p.group(1).split()
print result
['hello', 'world']
EDIT:
In case foo or baz are missing, and you need to return the entire string, use an if-else
:
if p is not None:
result = p.group(1).split()
else:
result = my_str
Why the ?
in the pattern:
Suppose there are multiple occurrences of the word baz
:
my_str = "foo hello world baz 33 there is another baz"
using pattern = 'foo\s(.*)\sbaz'
will match(longest and greedy) :
'hello world baz 33 there is another'
whereas , using pattern = 'foo\s(.*?)\sbaz'
will return the shortest match:
'hello world'
Regex with variable number of groups in ruby or a workaround
Anyway, this one works using the \G
construct for version 1.93 on rubular.
In a single match, it grabs the first 5 pts and skips the 6th, then repeats.
(?:(?!^)\G[-,\s]|C)\s*(-?\d+(?:\.\d+)?(?:[eE][+-]?\d+)?)[-,\s](-?\d+(?:\.\d+)?(?:[eE][+-]?\d+)?)[-,\s](-?\d+(?:\.\d+)?(?:[eE][+-]?\d+)?)[-,\s](-?\d+(?:\.\d+)?(?:[eE][+-]?\d+)?)[-,\s](-?\d+(?:\.\d+)?(?:[eE][+-]?\d+)?)(?:[-,\s]-?\d+(?:\.\d+)?(?:[eE][+-]?\d+)?)?
Explained
(?:
(?! ^ ) # Not BOS
\G # Start where last match left off to get next 5 pts.
[-,\s] # required separator
| # or,
C # C - the start of a block of pts.
)
# The first/next 5 pts. captured
\s*
( # (1 start)
-? \d+
(?: \. \d+ )?
(?: [eE] [+-]? \d+ )?
) # (1 end)
[-,\s]
( # (2 start)
-? \d+
(?: \. \d+ )?
(?: [eE] [+-]? \d+ )?
) # (2 end)
[-,\s]
( # (3 start)
-? \d+
(?: \. \d+ )?
(?: [eE] [+-]? \d+ )?
) # (3 end)
[-,\s]
( # (4 start)
-? \d+
(?: \. \d+ )?
(?: [eE] [+-]? \d+ )?
) # (4 end)
[-,\s]
( # (5 start)
-? \d+
(?: \. \d+ )?
(?: [eE] [+-]? \d+ )?
) # (5 end)
(?: # Skip the 6th pt.
[-,\s]
-? \d+
(?: \. \d+ )?
(?: [eE] [+-]? \d+ )?
)?
How to capture multiple repeated groups?
With one group in the pattern, you can only get one exact result in that group. If your capture group gets repeated by the pattern (you used the +
quantifier on the surrounding non-capturing group), only the last value that matches it gets stored.
You have to use your language's regex implementation functions to find all matches of a pattern, then you would have to remove the anchors and the quantifier of the non-capturing group (and you could omit the non-capturing group itself as well).
Alternatively, expand your regex and let the pattern contain one capturing group per group you want to get in the result:
^([A-Z]+),([A-Z]+),([A-Z]+)$
Regular expression to match page number groups
This worked for me - am I missing something?
Sub Pages()
Dim re As Object, allMatches, m, rv, sep, c As Range, i As Long
Set re = CreateObject("VBScript.RegExp")
re.Pattern = "(\d+(-\d+)?)"
re.ignorecase = True
re.MultiLine = True
re.Global = True
For Each c In Range("B5:B20").Cells 'for example
c.Offset(0, 1).Resize(1, 10).ClearContents 'clear output cells
i = 0
If re.test(c.Value) Then
Set allMatches = re.Execute(c.Value)
For Each m In allMatches
i = i + 1
c.Offset(0, i).Value = m
Next m
End If
Next c
End Sub
Related Topics
Java Jar File: Use Resource Errors: Uri Is Not Hierarchical
Why Does Math.Round(0.49999999999999994) Return 1
How to Create and Handle Composite Primary Key in JPA
Java - How Would I Dynamically Add Swing Component to Gui on Click
Ignore Duplicates When Producing Map Using Streams
How to Create Custom Exceptions in Java
How to Perform String Diffs in Java
Does Java Evaluate Remaining Conditions After Boolean Result Is Known
Hibernate, @Sequencegenerator and Allocationsize
Log4J Redirect Stdout to Dailyrollingfileappender
Difference Between Jar and War in Java
Does Setting Java Objects to Null Do Anything Anymore
Why Not to Start a Thread in the Constructor? How to Terminate
How to Run Linux Commands in Java
How to Set an Image as a Background for Frame in Swing Gui of Java