How to Find All Matches to a Regular Expression in Python

How can I find all matches to a regular expression in Python?

Use re.findall or re.finditer instead.

re.findall(pattern, string) returns a list of matching strings.

re.finditer(pattern, string) returns an iterator over MatchObject objects.

Example:

re.findall( r'all (.*?) are', 'all cats are smarter than dogs, all dogs are dumber than cats')
# Output: ['cats', 'dogs']

[x.group() for x in re.finditer( r'all (.*?) are', 'all cats are smarter than dogs, all dogs are dumber than cats')]
# Output: ['all cats are', 'all dogs are']

Python regex find all matches

You are slightly overcomplicating your regex by misusing the . which matches any character while not actually needing it and using a capturing group () without really using it.

With your pattern you are looking for a number in scientific notation which has to be BOTH preceded and followed by exactly one character.

{8.25e+07|8.26206e+07}
[--------]

After re.findall traverses your string from the beginning it finds your defined pattern, which then drops the { and the | because of your capturing group (..) and saves this as a match. It then continues but only has 8.26206e+07} left. That now does not satisfy your pattern, because it is missing one "any" character for your first ., and no further match is found. Note that findall only looks for non-overlapping matches[1].

To illustrate, change your input string by duplicating your separator |:

>>> p = ".([0-9]+\.[0-9]+[eE][-+]?[0-9]+)."
>>> s = "{8.25e+07||8.26206e+07}"
>>> print(re.findall(p, s))
['8.25e+07', '8.26206e+07']

To satisfy your two .s you need two separators between any two numbers.

Two things I would change in your pattern, (1) remove the .s and (2) remove your capturing group ( ), you have no need for it:

p = "[0-9]+\.[0-9]+[eE][-+]?[0-9]+"

Capturing groups can be very useful if you need to refer to specific captured groups again later, but your task at hand has no need for them.

[1] https://docs.python.org/2/library/re.html?highlight=findall#re.findall

finding all matches in a string using regex

You can try (?=(aa))

The trick is that you use positive lookahead, which doesn't consume string, this way engine starts matching at the next position in string, not after last matched text.

You will get 3 matches and each will have aa in first captuirng group.

Demo

Regular expression to return all match occurrences

The issue is with the regular expression used.
The (.*) blocks are accepting more of the string than you realize - .* is referred to as a greedy operation and it will consume as much of the string as it can while still matching. This is why you only see one output.

Suggest matching something like Vacation Allowance:\s*\d+; or similar.

text = '02/05/2020 Vacation Allowance: 21; 02/05/2020 Vacation Allowance: 22; nnn'
m = re.findall('Vacation Allowance:\s*(\d*);', text, re.M)
print(m)

result: ['21', '22']

Python regex get all matches all with findall

Here is one approach

>>> regex = re.compile("(?<=\[)([0-9]){1}?(?=\])")
>>> string = 'start asf[2]+asdfsa[0]+fsad[1]'
>>> re.findall(regex, string)
['2', '0', '1']

DEMO

>>> import re
>>> def get_all_integers_between_square_brackets(*, regex, string):
... return map(int, re.findall(regex, string))
...
>>> regex = re.compile("(?<=\[)([0-9]){1}?(?=\])")
>>> integers = get_all_integers_between_square_brackets(
regex=regex ,
string='start asf[2]+asdfsa[0]+fsad[1]'
)
>>> list(integers)
[2, 0, 1]

>>> integers = get_all_integers_between_square_brackets(
regex=regex,
string='start asf[hello]+asdfsa[world]+fsad[1][2][]')
>>> list(integers)
[1, 2]

python) find all matches using regex (changed to re.findall from re.search)

Change this:

matchexh = re.search(r'Exhibit (\d+).(\d+)',text1).group().strip()

to:

matchexh = re.findall(r'Exhibit (\d+).(\d+)',text1)

Python - Using regex to find multiple matches and print them out

Do not use regular expressions to parse HTML.

But if you ever need to find all regexp matches in a string, use the findall function.

import re
line = 'bla bla bla<form>Form 1</form> some text...<form>Form 2</form> more text?'
matches = re.findall('<form>(.*?)</form>', line, re.DOTALL)
print(matches)

# Output: ['Form 1', 'Form 2']

How to find all matches with a regex where part of the match overlaps

The (\w+\.\s){1,2} pattern contains a repeated capturing group, and Python re does not store all the captures it finds, it only saves the last one into the group memory buffer. At any rate, you do not need the repeated capturing group because you need to extract multiple occurrences of the pattern from a string, and re.finditer or re.findall will do that for you.

Also, the re.MULTILINE flag is not necessar here since there are no ^ or $ anchors in the pattern.

You may get the expected results using

import re
test_str = 'ali. veli. ahmet.'
src = re.findall(r'(?=\b(\w+\.\s+\w+))', test_str)
print(src)
# => ['ali. veli', 'veli. ahmet']

See the Python demo

The pattern means

  • (?= - start of a positive lookahead

    • \b - a word boundary (crucial here, it is necessary to only start capturing at word boundaries)
    • (\w+\.\s+\w+) - Capturing group 1: 1+ word chars, ., 1+ whitespaces and 1+ word chars
  • ) - end of the lookahead.

Python Regex - How to Get Positions and Values of Matches

import re
p = re.compile("[a-z]")
for m in p.finditer('a1b2c3d4'):
print(m.start(), m.group())


Related Topics



Leave a reply



Submit