Python Extract Pattern Matches

Python extract pattern matches

You need to capture from regex. search for the pattern, if found, retrieve the string using group(index). Assuming valid checks are performed:

>>> p = re.compile("name (.*) is valid")
>>> result = p.search(s)
>>> result
<_sre.SRE_Match object at 0x10555e738>
>>> result.group(1) # group(1) will return the 1st capture (stuff within the brackets).
# group(0) will returned the entire matched text.
'my_user_name'

Extract part of a regex match

Use ( ) in regexp and group(1) in python to retrieve the captured string (re.search will return None if it doesn't find the result, so don't use group() directly):

title_search = re.search('<title>(.*)</title>', html, re.IGNORECASE)

if title_search:
title = title_search.group(1)

Python - Extract pattern from string using RegEx

import re
pattern = re.compile(r'foo\(.*?\)')
test_str = 'foo(123456) together with foo(2468)'

for match in re.findall(pattern, test_str):
print(match)

Two things:

  1. .*? is the lazy quantifier. It behaves the same as the greedy quantifier (.*), except it tries to match the least amount of characters possible going from left-to-right across the string. Note that if you want to match at least one character between the parentheses, you'll want to use .+?.

  2. Use \( and \) instead of ( and ) because parentheses are normally used inside regular expressions to indicate capture groups, so if you want to match parentheses literally, you have to use the escape character before them, which is backslash.

Python - Regex findall extract all patterns that may be substring of one another

In the situation where one keyword is a substring of another, you will need to iterate over your keywords as matching using regex will always pick one or the other (most modules such as re pick the first match in the alternation - see here) at a given point in the string, but never both. You could iterate over the keywords to ensure you find all matches using code like this:

import re

string = "A B C D"
keys = ["A", "B", "A B"]

matches = []
for k in keys:
matches += re.findall(re.escape(k), string)

print(matches)

Output

['A', 'B', 'A B']

Demo on ideone

How do I return a string from a regex match in python?

You should use re.MatchObject.group(0). Like

imtag = re.match(r'<img.*?>', line).group(0)

Edit:

You also might be better off doing something like

imgtag  = re.match(r'<img.*?>',line)
if imtag:
print("yo it's a {}".format(imgtag.group(0)))

to eliminate all the Nones.

How to extract the substring between two markers?

Using regular expressions - documentation for further reference

import re

text = 'gfgfdAAA1234ZZZuijjk'

m = re.search('AAA(.+?)ZZZ', text)
if m:
found = m.group(1)

# found: 1234

or:

import re

text = 'gfgfdAAA1234ZZZuijjk'

try:
found = re.search('AAA(.+?)ZZZ', text).group(1)
except AttributeError:
# AAA, ZZZ not found in the original string
found = '' # apply your error handling

# found: 1234

use regex to extract multiple strings following certain pattern

If you want to return all the matches individually using only a single findall, then you'll need to make use of positive lookbehind, e.g. (?<=foo). Python module re unfortunately only supports fixed-width lookbehind. However, if you're willing to use the outstanding regex module, then it can be done.

Regex:

(?<=Invalid items: \([^)]*)[^ ;)]+

Demonstration: https://regex101.com/r/p90Z81/1

If there can be empty items, a small modification to the regex allows capture of these zero-width matches, as follows:

(?<=Invalid items: \([^)]*)(?:[^ ;)]+|(?<=\(| ))

python regex: extract list elements, each of which matches multiple patterns

Why you used two regex, actually it can finish in one regex

import re

somelist = [
'AAAA 1234 SD OXD',
'AAAB 2342 DF BDD',
'ERTE 3454 RE DFD',
'GWED 1234 SD TCD',
'AAAA 2353 SD MKX',
'VERD 1234 IO ERT',
'AAAA 2353 SD MKX',
'AAAA 2353 SD MKX']

print(list(filter(lambda x : re.search(r".{6}1234\s{3}SD",x) ,somelist)))
# ['AAAA 1234 SD OXD', 'GWED 1234 SD TCD']

Regex extract word starting with a set string and ending the line or ending with ;

Match SAT and everything not a space, semicolon or newline:

\bSAT[^ ;\n]*

See live demo.



Related Topics



Leave a reply



Submit