Capturing Repeating Subpatterns in Python Regex

Capture repeated groups in python regex

If you cannot use PyPi regex library, you will have to do that in two steps: 1) grab the lines with sm-mta and 2) grab the values you need, with something like

import re

txt="""Aug 15 00:01:06 **** sm-mta*** to=<user1@gmail.com>,<user2@yahoo.com>,user3@aol.com, some_more_stuff
Aug 16 13:16:09 **** sendmail*** to=<user4@yahoo.com>, some_more_stuff
Aug 17 11:14:48 **** sm-mta*** to=<user5@gmail.com>,<user6@gmail.com>, some_more_stuff"""
rx = r'@([^\s>,]+)'
filtered_lines = [x for x in txt.split('\n') if 'sm-mta' in x]
print(re.findall(rx, " ".join(filtered_lines)))

See the Python demo online. The @([^\s>,]+) pattern will match @ and will capture and return any 1+ chars other than whitespace, > and ,.

If you can use PyPi regex library, you may get the list of the strings you need with

>>> import regex
>>> x="""Aug 15 00:01:06 **** sm-mta*** to=<user1@gmail.com>,<user2@yahoo.com>,user3@aol.com, some_more_stuff
Aug 16 13:16:09 **** sendmail*** to=<user4@yahoo.com>, some_more_stuff
Aug 17 11:14:48 **** sm-mta*** to=<user5@gmail.com>,<user6@gmail.com>, some_more_stuff"""
>>> rx = r'(?:^(?=.*sm-mta)|\G(?!^)).*?@\K[^\s>,]+'
>>> print(regex.findall(rx, x, regex.M))
['gmail.com', 'yahoo.com', 'aol.com,', 'gmail.com', 'gmail.com']

See the Python online demo and a regex demo.

Pattern details

  • (?:^(?=.*sm-mta)|\G(?!^)) - a line that has sm-mta substring after any 0+ chars other than line break chars, or the place where the previous match ended
  • .*?@ - any 0+ chars other than line break chars, as few as possible, up to the @ and a @ itself
  • \K - a match reset operator that discards all the text matched so far in the current iteration
  • [^\s>,]+ - 1 or more chars other than whitespace, , and >

Capturing repeating sub-patterns with permutations in Python regex

You can use /([a-z]+|\d+|_)/i to chunk the string into groups of digits, alphabetical characters or single underscores:

>>> re.findall(r"([a-z]+|\d+|_)", "ABC_123_DEF_456", re.I)
['ABC', '_', '123', '_', 'DEF', '_', '456']
>>> re.findall(r"([a-z]+|\d+|_)", "ABC123__", re.I)
['ABC', '123', '_', '_']

Regex to match all repeating alphanumerical subpatterns

Using this lookahead based regex you may not get exactly as you are showing in question but will get very close.

r'(?=(.+)\1\1)'

RegEx Demo

Code:

>>> reg = re.compile(r'(?=(.+)\1\1)')
>>> reg.findall('aaabbbxxx_aaabbbxxx_aaabbbxxx_')
['aaabbbxxx_', 'b', 'x', 'a', 'b', 'x', 'a', 'b', 'x']
>>> reg.findall('lalala luuluuluul')
['la', 'luu', 'uul']

RegEx Details:

Since we're using a lookahead as full regex we are not really consuming character since lookahead is a zero width match. This allows us to return overlapping matches from input.

Using findall we only return capture group in our regex.

  • (?=: Start lookahead
    • (.+): Match 1 or more of any character (greedy) and capture in group #1
    • \1\1: Match 2 occurrence of group #1 using back-reference \1\1
  • ): End lookahead

REGEX Capturing differing sets of repeating groups

You may use

(?:\G(?!^)(?(?=\d+(?:aa|bb))(?<!\dcc))|(?=(?:\d+(?:aa|bb))+(?:\d+cc)+))(\d+)(aa|bb|cc)

See the regex demo

The regex will only match the string that meets the pattern in the (?=(?:\d+(?:aa|bb))+(?:\d+cc)+) lookahead, and then will consecutively match and capture digits and aa, bb or cc, but digits + aa or bb will be matched unless digits + cc is not in front.

Details

  • (?:\G(?!^)(?(?=\d+(?:aa|bb))(?<!\dcc))|(?=(?:\d+(?:aa|bb))+(?:\d+cc)+)) - either of the two alternatives:

    • \G(?!^) - end of the previous successful match
    • (?(?=\d+(?:aa|bb))(?<!\dcc)) - if-then-else construct: if there is 1+ digits and aa or bb immediately to the right of the current location ((?=\d+(?:aa|bb)), then only continue matching if there is no digit followed with cc immediately to the left of the current location ((?<!\dcc))
    • | - or
    • ^ - start of string
    • (?=(?:\d+(?:aa|bb))+(?:\d+cc)+) - a positive lookahead that, immediately to the right of the current location, searches for the following (and returns true if it finds the patterns, or false if it does not):

      • (?:\d+(?:aa|bb))+ - one or more occurrences of 1+ digits followed with aa or bb
      • (?:\d+cc)+ - one or more occurrences of 1+ digits followed with cc
  • (\d+) - Group 1: one or more digits
  • (aa|bb|cc) - aa, bb or cc.

For the second pattern, replace cc with (?:aa|bb):

(?:\G(?!^)(?(?=\d+cc)(?<!\d(?:aa|bb)))|(?=(?:\d+cc)+(?:\d+(?:aa|bb))+))(\d+)(aa|bb|cc)

How to capture multiple repeated groups?

With one group in the pattern, you can only get one exact result in that group. If your capture group gets repeated by the pattern (you used the + quantifier on the surrounding non-capturing group), only the last value that matches it gets stored.

You have to use your language's regex implementation functions to find all matches of a pattern, then you would have to remove the anchors and the quantifier of the non-capturing group (and you could omit the non-capturing group itself as well).

Alternatively, expand your regex and let the pattern contain one capturing group per group you want to get in the result:

^([A-Z]+),([A-Z]+),([A-Z]+)$


Related Topics



Leave a reply



Submit