Python - Regex Match Multiple Patterns in Multiple Lines

Regular expression matching a multiline block of text

Try this:

re.compile(r"^(.+)\n((?:\n.+)+)", re.MULTILINE)

I think your biggest problem is that you're expecting the ^ and $ anchors to match linefeeds, but they don't. In multiline mode, ^ matches the position immediately following a newline and $ matches the position immediately preceding a newline.

Be aware, too, that a newline can consist of a linefeed (\n), a carriage-return (\r), or a carriage-return+linefeed (\r\n). If you aren't certain that your target text uses only linefeeds, you should use this more inclusive version of the regex:

re.compile(r"^(.+)(?:\n|\r\n?)((?:(?:\n|\r\n?).+)+)", re.MULTILINE)

BTW, you don't want to use the DOTALL modifier here; you're relying on the fact that the dot matches everything except newlines.

Python - Regex match multiple patterns in multiple lines

import re

regex = r"""
Orion/PatchDetails/process_form.+?release=\d+
(.+) # use this as your match
zip[^\"]
"""

matches = re.compile(regex, re.MULTILINE | re.DOTALL | re.VERBOSE)

Add re.DOTALL to let . include \n. For your regex, this lets you match multiple lines

https://regex101.com/r/jBwq20/1

Python regex match across multiple lines

You may use

config_value = 'Example'
pattern=r'(?sm)^{}=(.*?)(?=[\r\n]+\w+=|\Z)'.format(config_value)
match = re.search(pattern, s)
if match:
print(match.group(1))

See the Python demo.

Pattern details

  • (?sm) - re.DOTALL and re.M are on
  • ^ - start of a line
  • Example= - a substring
  • (.*?) - Group 1: any 0+ chars, as few as possible
  • (?=[\r\n]+\w+=|\Z) - a positive lookahead that requires the presence of 1+ CR or LF symbols followed with 1 or more word chars followed with a = sign, or end of the string (\Z).

See the regex demo.

Multiline regex match retrieving line numbers and matches

you could first find all \n characters in the text and their respective position/character index. since each \n...well...starts a new line, the index of each value in this list indicates the line number the found \n character terminates. then search all occurrences of you pattern and use the aforementioned list to look up the start/end position of the match...

import re
import bisect

text = """new
File()
aa new File()
new
File()
there is a new File() and new
File() again
new



File()
there is not a matching pattern here File() new
new File() test new File() occurs twice on this line
"""

# character indices of all \n characters in text
nl = [m.start() for m in re.finditer("\n", text, re.MULTILINE|re.DOTALL)]

matches = list(re.finditer(r"(new\s+File\(\))", text, re.MULTILINE|re.DOTALL))
match_count = 0
for m in matches:
match_count += 1
r = range(bisect.bisect(nl, m.start()-1), bisect.bisect(nl, m.end()-1)+1)
print(re.sub(r"\s+", " ", m.group(1), re.DOTALL), "found on line(s)", *r)
print(f"{match_count} occurrences of new File() found in file....")

output:

new File() found on line(s) 0 1
new File() found on line(s) 2
new File() found on line(s) 3 4
new File() found on line(s) 5
new File() found on line(s) 5 6
new File() found on line(s) 7 8 9 10 11
new File() found on line(s) 13
new File() found on line(s) 13
8 occurrences of new File() found in file....

regex multiline not working on repeated patterns

You should use Lazy quantifiers, as they stop on the first match they find:

\bSTDMETHOD\b((.|\n|\r)*?)\bPURE\b

Tested on Regexr.com

enter image description here

regex matching whitespace characters and multiple optional patterns before start of text

Your pattern matches one instance of # ... or !mycommand. One way to solve this problem is to put all of them into one match, and use re.search to find the first match.

To do this, you need to repeat the part that matches # ... or !mycommand using *:

^\s*^(?:#.*\s*|!mycommand\s*)*

I have also changed #.* to #.*\s* so that it goes all the way to the next line where a non-whitespace is found.

Demo

Responding to your comment:

if the string begins with code, this regex should not match anything

You can try:

\A\s*^(?:#.*\s*|!mycommand\s*)+

I changed to \A so that it only matches the absolute start of the string, instead of start of line. I also changed the last * to + so at least one # ... or !mycommand has to be present.

Matching regular expression to multiple line blocks in python

There are numerous ways that this could be solved, here is one. I've added the information to a dictionary, where you will get a list of dictionaries as an output.

def parse_doc(filename):
with open(filename) as f:
pattern1 = re.compile(r'<sec name="(\D_\d\d_\w+)"\s+sound_freq="(\D\D\D\d+:\d+-\d+)"')
pattern2 = re.compile(r'<per fre="(Volum_+\d+Kb)"+\svalue="(\d+.+)"')

doc = []
for i in f.readlines():
p1 = re.match(pattern1, i)
p2 = re.match(pattern2, i)
line = {}
if p1:
line.update({'sec': p1.group(1), 'sound_freq': p1.group(2)})
if p2:
line.update({p2.group(1): p2.group(2)})
if len(line)>0:
doc.append(line)

return doc

print(parse_doc('doc.txt'))

Output

[{'sec': 'M_20_K40745170', 'sound_freq': 'mhr17:7907527-7907589'}, {'Volum_5Kb': '89.00'}, {'Volum_40Kb': '00.00'}, {'Volum_70Kb': '77.00'}]

If you want to get all the values you could get it using the following:

def parse_doc_all(filename):
with open(filename) as f:
pattern1 = re.compile(r'(.|\w+)="([^\s]+)"')
doc = {}
for i in f.readlines():
doc.update({p[0]: p[1] for p in re.findall(pattern1, i)})

return doc

print(parse_doc_all('doc.txt'))

Which will give you

{'name': 'M_20_K40745170', 'sound_freq': 'mhr17:7907527-7907589', 'tension': 'SGCGSCGSCGSCGSC', 's_c': '0', 'number': '5748', 'v': '0.1466469683747654', 'y': '0.0', 'units': 'sec', 'first_name': 'g7tty', 'description': 'xyz', 'abc': 'trt', 'id': 'abc', 'fre': 'Volum_70Kb', 'value': '77.00'}


Related Topics



Leave a reply



Submit