Re.Findall Not Returning Full Match

re.findall not returning full match?

The problem you have is that if the regex that re.findall tries to match captures groups (i.e. the portions of the regex that are enclosed in parentheses), then it is the groups that are returned, rather than the matched string.

One way to solve this issue is to use non-capturing groups (prefixed with ?:).

>>> import re
>>> s = 'size=50;size=51;'
>>> re.findall('size=(?:50|51);', s)
['size=50;', 'size=51;']

If the regex that re.findall tries to match does not capture anything, it returns the whole of the matched string.

Although using character classes might be the simplest option in this particular case, non-capturing groups provide a more general solution.

Why does findall not return the whole match when matching with a group?

You should use re.finditer instead of re.findall and then print the whole matching group:

>>> for m in re.finditer('(ra|RA)[a-zA-Z0-9]*',"RAJA45909"):
... print(m.group())
...
RAJA45909

The documentation of findall says:

If one or more groups are present in the pattern, return a list of
groups; this will be a list of tuples if the pattern has more than one
group.

Your regex has only one group and thus the result is a list of texts matched by that single group. If we add an other group you see:

>>> for m in re.findall('(ra|RA)([a-zA-Z0-9]*)',"RAJA45909"):
... print(m)
...
('RA', 'JA45909')

So findall when used with groups matches the whole regex but only returns the portions matched by the groups. While finditer always returns a complete match object.

python regex - findall not returning output as expected

When you use parentheses in your regex, re.findall() will return only the parenthesized groups, not the entire matched string. Put a ?: after the ( to tell it not to use the parentheses to extract a group, and then the results should be the entire matched string.

re.findall not returning correct results

As per re.findall documentation:

If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.

So, turn all capturing groups into non-capturing or remove them if possible (here, it is best to remove them as they are just redundant):

macs = re.findall(r"[0-9A-Fa-f]{4}\.[0-9A-Fa-f]{4}\.[0-9A-Fa-f]{4}"‌​,test)

Python regex findall function only returning matchings on groups instead of full string

Use '(b(.a)*)' as your regex pattern instead. You need result[0] in the following example.

import re

result = re.findall('(b(.a)*)', 'bcacaca')
result

Output:

[('bcacaca', 'ca')]

A Better Option - Using a Non-capturing Group

As @Nick mentioned, a non-capturing group could be used here as follows. Consider the following scenario. For step-by-step explanation see the next section. Also, I encourage you to use this resource: regex101.com.

## Define text and pattern
text = 'bcacaca dcaca dbcaca'
pattern = 'b?(?:.a)*'

## Evaluate regex
result = re.findall(pattern, text)
# output
# ['bcacaca', '', '', 'caca', '', '', 'bcaca', '']

## Drop empty strings from result
result = list(filter(None, result))
# output
# ['bcacaca', 'caca', 'bcaca']

Explanation for Using a Non-capturing Group

Sample Image

References

  1. Remove empty strings from a list of strings

How to return a string if a re.findall finds no match

You could do this in a single line:

results += re.findall(pattern, extracted_string) or ["Error"]

BTW, you get no benefit from compiling the pattern inside the vendor loop because you're only using it once.

Your function could also return the whole search result using a single list comprehension:

return [m for v in vendor for m in re.findall(v, extracted_string) or ["Error"]]

It is a bit weird that you would actually want to modify AND return the results list being passed as parameter. This may produce some unexpected side effects when you use the function.

Your "Error" flag may appear several times in the result list, and given that each pattern may return multiple matches, it will be hard to determine which pattern failed to find a value.

If you only want to signal an error when none of the vendor patterns match, you could use the or ["Error"] trick on whole result:

return [m for v in vendor for m in re.findall(v, extracted_string)] or ["Error"]

Why re.findall does not find the match in this case?

Your code works and finds all - you just misunderstand regex GROUPs and its usage when calling findall:

# code partially generated by regex101.com to demonstrate the issue
# see https://regex101.com/r/Gngy0r/1

import re

regex = r"\s([0-9A-Z]+\w*)\s+\S*?[Aa]lloy\s"

test_str = " 1AZabc sdfsdfAlloy "

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):

print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1

print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

# use findall and print its results
print(re.findall(regex, test_str))

Output:

# full match that you got 
Match 1 was found at 0-20: 1AZabc sdfsdfAlloy
# and what was captured
Group 1 found at 1-7: 1AZabc

# findall only gives you the groups ...
['1AZabc']

Either remove the ( ) or put all into () that you are interested in:

regex = r"\s([0-9A-Z]+\w*\s+\S*?[Aa]lloy)\s"


Related Topics



Leave a reply



Submit