Capture repeated groups in python regex
If you cannot use PyPi regex library, you will have to do that in two steps: 1) grab the lines with sm-mta
and 2) grab the values you need, with something like
import re
txt="""Aug 15 00:01:06 **** sm-mta*** to=<user1@gmail.com>,<user2@yahoo.com>,user3@aol.com, some_more_stuff
Aug 16 13:16:09 **** sendmail*** to=<user4@yahoo.com>, some_more_stuff
Aug 17 11:14:48 **** sm-mta*** to=<user5@gmail.com>,<user6@gmail.com>, some_more_stuff"""
rx = r'@([^\s>,]+)'
filtered_lines = [x for x in txt.split('\n') if 'sm-mta' in x]
print(re.findall(rx, " ".join(filtered_lines)))
See the Python demo online. The @([^\s>,]+)
pattern will match @
and will capture and return any 1+ chars other than whitespace, >
and ,
.
If you can use PyPi regex library, you may get the list of the strings you need with
>>> import regex
>>> x="""Aug 15 00:01:06 **** sm-mta*** to=<user1@gmail.com>,<user2@yahoo.com>,user3@aol.com, some_more_stuff
Aug 16 13:16:09 **** sendmail*** to=<user4@yahoo.com>, some_more_stuff
Aug 17 11:14:48 **** sm-mta*** to=<user5@gmail.com>,<user6@gmail.com>, some_more_stuff"""
>>> rx = r'(?:^(?=.*sm-mta)|\G(?!^)).*?@\K[^\s>,]+'
>>> print(regex.findall(rx, x, regex.M))
['gmail.com', 'yahoo.com', 'aol.com,', 'gmail.com', 'gmail.com']
See the Python online demo and a regex demo.
Pattern details
(?:^(?=.*sm-mta)|\G(?!^))
- a line that hassm-mta
substring after any 0+ chars other than line break chars, or the place where the previous match ended.*?@
- any 0+ chars other than line break chars, as few as possible, up to the@
and a@
itself\K
- a match reset operator that discards all the text matched so far in the current iteration[^\s>,]+
- 1 or more chars other than whitespace,,
and>
Capturing repeating sub-patterns with permutations in Python regex
You can use /([a-z]+|\d+|_)/i
to chunk the string into groups of digits, alphabetical characters or single underscores:
>>> re.findall(r"([a-z]+|\d+|_)", "ABC_123_DEF_456", re.I)
['ABC', '_', '123', '_', 'DEF', '_', '456']
>>> re.findall(r"([a-z]+|\d+|_)", "ABC123__", re.I)
['ABC', '123', '_', '_']
Regex to match all repeating alphanumerical subpatterns
Using this lookahead based regex you may not get exactly as you are showing in question but will get very close.
r'(?=(.+)\1\1)'
RegEx Demo
Code:
>>> reg = re.compile(r'(?=(.+)\1\1)')
>>> reg.findall('aaabbbxxx_aaabbbxxx_aaabbbxxx_')
['aaabbbxxx_', 'b', 'x', 'a', 'b', 'x', 'a', 'b', 'x']
>>> reg.findall('lalala luuluuluul')
['la', 'luu', 'uul']
RegEx Details:
Since we're using a lookahead as full regex we are not really consuming character since lookahead is a zero width match. This allows us to return overlapping matches from input.
Using findall
we only return capture group in our regex.
(?=
: Start lookahead(.+)
: Match 1 or more of any character (greedy) and capture in group #1\1\1
: Match 2 occurrence of group #1 using back-reference\1\1
)
: End lookahead
REGEX Capturing differing sets of repeating groups
You may use
(?:\G(?!^)(?(?=\d+(?:aa|bb))(?<!\dcc))|(?=(?:\d+(?:aa|bb))+(?:\d+cc)+))(\d+)(aa|bb|cc)
See the regex demo
The regex will only match the string that meets the pattern in the (?=(?:\d+(?:aa|bb))+(?:\d+cc)+)
lookahead, and then will consecutively match and capture digits and aa
, bb
or cc
, but digits + aa
or bb
will be matched unless digits + cc
is not in front.
Details
(?:\G(?!^)(?(?=\d+(?:aa|bb))(?<!\dcc))|(?=(?:\d+(?:aa|bb))+(?:\d+cc)+))
- either of the two alternatives:\G(?!^)
- end of the previous successful match(?(?=\d+(?:aa|bb))(?<!\dcc))
- if-then-else construct: if there is 1+ digits andaa
orbb
immediately to the right of the current location ((?=\d+(?:aa|bb)
), then only continue matching if there is no digit followed withcc
immediately to the left of the current location ((?<!\dcc)
)|
- or^
- start of string(?=(?:\d+(?:aa|bb))+(?:\d+cc)+)
- a positive lookahead that, immediately to the right of the current location, searches for the following (and returns true if it finds the patterns, or false if it does not):(?:\d+(?:aa|bb))+
- one or more occurrences of 1+ digits followed withaa
orbb
(?:\d+cc)+
- one or more occurrences of 1+ digits followed withcc
(\d+)
- Group 1: one or more digits(aa|bb|cc)
-aa
,bb
orcc
.
For the second pattern, replace cc
with (?:aa|bb)
:
(?:\G(?!^)(?(?=\d+cc)(?<!\d(?:aa|bb)))|(?=(?:\d+cc)+(?:\d+(?:aa|bb))+))(\d+)(aa|bb|cc)
How to capture multiple repeated groups?
With one group in the pattern, you can only get one exact result in that group. If your capture group gets repeated by the pattern (you used the +
quantifier on the surrounding non-capturing group), only the last value that matches it gets stored.
You have to use your language's regex implementation functions to find all matches of a pattern, then you would have to remove the anchors and the quantifier of the non-capturing group (and you could omit the non-capturing group itself as well).
Alternatively, expand your regex and let the pattern contain one capturing group per group you want to get in the result:
^([A-Z]+),([A-Z]+),([A-Z]+)$
Related Topics
How to Cross Compile Python Interpreter for Windows Under Linux
Crontab Failed to Run Python Script at Reboot
Magicexception:File 5.41 Supports Only Version 16 Magic File, Magic.Mgc Is Version 14
Why Can't Python Sockets Resolve Url's with Http in It
Detect Face Then Autocrop Pictures
How to Get the Pythonpath in Shell
Serving a Request from Gunicorn
Error Installing Uwsgi in Virtualenv
Docker.Errors.Dockerexception: Error While Fetching Server API Version
Bring the Current Python Program to Background
Unix Socket Credential Passing in Python
To Read Line from File Without Getting "\N" Appended at the End
What Does "The Following Packages Will Be Superseded by a Higher Priority Channel" Mean
How to Use Expect on Windows Without Installing Cygwin
Can't Start Foreman in Heroku Tutorial Using Python
Does Python Do Variable Interpolation Similar to "String #{Var}" in Ruby