Regular Expression Matching a Multiline Block of Text

Regular expression matching a multiline block of text

Try this:

re.compile(r"^(.+)\n((?:\n.+)+)", re.MULTILINE)

I think your biggest problem is that you're expecting the ^ and $ anchors to match linefeeds, but they don't. In multiline mode, ^ matches the position immediately following a newline and $ matches the position immediately preceding a newline.

Be aware, too, that a newline can consist of a linefeed (\n), a carriage-return (\r), or a carriage-return+linefeed (\r\n). If you aren't certain that your target text uses only linefeeds, you should use this more inclusive version of the regex:

re.compile(r"^(.+)(?:\n|\r\n?)((?:(?:\n|\r\n?).+)+)", re.MULTILINE)

BTW, you don't want to use the DOTALL modifier here; you're relying on the fact that the dot matches everything except newlines.

Regular Expressions match a block from multiline html text

I tried the xml.etree.ElementTree module as explained by @kazbeel but it gave me error of "mismatched tag", which I found is the case in most instances of its usage. Then I found this BeautifulSoup module and used it, and it gave the desired results. The following code has covered another file pattern along with the above ones from the question.

File3:

<input id="realm_90" type="hidden" name="horizon" value="RADIUS">

Code:

from bs4 import BeautifulSoup ## module for parsing xml/html files
def get_realms(html_text):
realms=[]
soup=BeautifulSoup(html_text, 'lxml')
in_tag=soup.find(attrs={"name":"horizon"})
if in_tag.name == 'select':
for tag in in_tag.find_all():
realms.append(tag.attrs['value'])
elif in_tag.name == 'input':
realms.append(in_tag.attrs['value'])
return realms

I agree with @ZiTAL to not to use regular expressions when parsing xml/html files because it gets too complicated and there are number of libraries present for them.

Regex matching pattern in multiple lines without specific word in the match

You might use

^PAT_A[^;\n]*(?:\n(?![^\n;]*NOT_MATCH_THIS)[^;\n]*)*\n[^;\n]*PAT_B[^;]*;

In parts, the pattern matches:

  • ^ Start of string
  • PAT_A Match literally
  • [^;\n]* Optionally match any char except ; or a newline
  • (?: Non capture group (to repeat as a whole)
    • \n(?![^\n;]*NOT_MATCH_THIS) Match a newline, and assert that the string does not contain NOT_MATCH_THIS and does not contain a ; or a newline to stay on the same line
    • [^;\n]* If the previous assertion is true, match the whole line (no containing a ;)
  • )* Close the non capture group, and optionally repeat matching all lines
  • \n[^;\n]* Match a newline, and any char except ; or a newline
  • PAT_B[^;]*; Then match PAT_B followed by any char except ; followed by matching the ;

Regex demo

Regex matching over multiple lines

Your tries were pretty close. In the first one you probably need to set the flag that allows the . to match line feeds. It normally doesn't. In your second, you need to set the non-greedy ? mode on the anything match .*. Otherwise .* tries to match the entire rest of the text.

It would be something like this. /^ <br>\n\d+\s[a-zA-Z"“](.*?\n)*?<hr\/>/

But anyway, this is something that is best done in Perl. Perl is where all the advanced regex comes from.

use strict;
use diagnostics;

our $text =<<EOF;
The figure that now stood by its bows was tall and swart, with one white tooth <br>
evilly protruding from its steel-like lips. <br>
<br>
1 "Hardly" had they pulled out from under the ship’s lee, when a <br>
fourth keel, coming from the windward side, pulled round under the stern, <br>
and showed the five strangers <br>
127 <br>
<br>
<hr/>
More text.
EOF

our $regex = qr{^ <br>\n\d+ +[A-Z"“].*?<hr/>}ism;
$text =~ s/($regex)/<!-- Removed -->/;
print "Removed text:\n[$1]\n\n";
print "New text:\n[$text]\n";

That prints:

Removed text:
[ <br>
1 "Hardly" had they pulled out from under the ship’s lee, when a <br>
fourth keel, coming from the windward side, pulled round under the stern, <br>
and showed the five strangers <br>
127 <br>
<br>
<hr/>]

New text:
[The figure that now stood by its bows was tall and swart, with one white tooth <br>
evilly protruding from its steel-like lips. <br>
<!-- Removed -->
More text.
]

The qr operator builds a regular expression so that it can be stored in a variable. The ^ at the beginning means to anchor this match at the beginning of a line. The ism on the end stands for case insensitive, single string, multiple embedded lines. s allows . to match line feeds. m allows ^ to match at the beginning of lines embedded in the string. You would add a g flag to end of the substitution to do a global replacement. s///g

The Perl regex documentation explains everything.
https://perldoc.perl.org/perlretut

See also Multiline replace in perl with extended expressions not working.

HTH

Python regex match across multiple lines

You may use

config_value = 'Example'
pattern=r'(?sm)^{}=(.*?)(?=[\r\n]+\w+=|\Z)'.format(config_value)
match = re.search(pattern, s)
if match:
print(match.group(1))

See the Python demo.

Pattern details

  • (?sm) - re.DOTALL and re.M are on
  • ^ - start of a line
  • Example= - a substring
  • (.*?) - Group 1: any 0+ chars, as few as possible
  • (?=[\r\n]+\w+=|\Z) - a positive lookahead that requires the presence of 1+ CR or LF symbols followed with 1 or more word chars followed with a = sign, or end of the string (\Z).

See the regex demo.

Match multiline text using regular expression

First, you're using the modifiers under an incorrect assumption.

Pattern.MULTILINE or (?m) tells Java to accept the anchors ^ and $ to match at the start and end of each line (otherwise they only match at the start/end of the entire string).

Pattern.DOTALL or (?s) tells Java to allow the dot to match newline characters, too.

Second, in your case, the regex fails because you're using the matches() method which expects the regex to match the entire string - which of course doesn't work since there are some characters left after (\\W)*(\\S)* have matched.

So if you're simply looking for a string that starts with User Comments:, use the regex

^\s*User Comments:\s*(.*)

with the Pattern.DOTALL option:

Pattern regex = Pattern.compile("^\\s*User Comments:\\s+(.*)", Pattern.DOTALL);
Matcher regexMatcher = regex.matcher(subjectString);
if (regexMatcher.find()) {
ResultString = regexMatcher.group(1);
}

ResultString will then contain the text after User Comments:

How do I match any character across multiple lines in a regular expression?

It depends on the language, but there should be a modifier that you can add to the regex pattern. In PHP it is:

/(.*)<FooBar>/s

The s at the end causes the dot to match all characters including newlines.

Matching regular expression to multiple line blocks in python

There are numerous ways that this could be solved, here is one. I've added the information to a dictionary, where you will get a list of dictionaries as an output.

def parse_doc(filename):
with open(filename) as f:
pattern1 = re.compile(r'<sec name="(\D_\d\d_\w+)"\s+sound_freq="(\D\D\D\d+:\d+-\d+)"')
pattern2 = re.compile(r'<per fre="(Volum_+\d+Kb)"+\svalue="(\d+.+)"')

doc = []
for i in f.readlines():
p1 = re.match(pattern1, i)
p2 = re.match(pattern2, i)
line = {}
if p1:
line.update({'sec': p1.group(1), 'sound_freq': p1.group(2)})
if p2:
line.update({p2.group(1): p2.group(2)})
if len(line)>0:
doc.append(line)

return doc

print(parse_doc('doc.txt'))

Output

[{'sec': 'M_20_K40745170', 'sound_freq': 'mhr17:7907527-7907589'}, {'Volum_5Kb': '89.00'}, {'Volum_40Kb': '00.00'}, {'Volum_70Kb': '77.00'}]

If you want to get all the values you could get it using the following:

def parse_doc_all(filename):
with open(filename) as f:
pattern1 = re.compile(r'(.|\w+)="([^\s]+)"')
doc = {}
for i in f.readlines():
doc.update({p[0]: p[1] for p in re.findall(pattern1, i)})

return doc

print(parse_doc_all('doc.txt'))

Which will give you

{'name': 'M_20_K40745170', 'sound_freq': 'mhr17:7907527-7907589', 'tension': 'SGCGSCGSCGSCGSC', 's_c': '0', 'number': '5748', 'v': '0.1466469683747654', 'y': '0.0', 'units': 'sec', 'first_name': 'g7tty', 'description': 'xyz', 'abc': 'trt', 'id': 'abc', 'fre': 'Volum_70Kb', 'value': '77.00'}

Python regex, matching pattern over multiple lines.. why isn't this working?

Try re.findall(r"####(.*?)\s(.*?)\s####", string, re.DOTALL) (works with re.compile too, of course).

This regexp will return tuples containing the number of the section and the section content.

For your example, this will return [('1', 'ttteest'), ('2', ' \n\nttest')].

(BTW: your example won't run, for multiline strings, use ''' or """)



Related Topics



Leave a reply



Submit