Regular expression matching a multiline block of text
Try this:
re.compile(r"^(.+)\n((?:\n.+)+)", re.MULTILINE)
I think your biggest problem is that you're expecting the ^
and $
anchors to match linefeeds, but they don't. In multiline mode, ^
matches the position immediately following a newline and $
matches the position immediately preceding a newline.
Be aware, too, that a newline can consist of a linefeed (\n
), a carriage-return (\r
), or a carriage-return+linefeed (\r\n
). If you aren't certain that your target text uses only linefeeds, you should use this more inclusive version of the regex:
re.compile(r"^(.+)(?:\n|\r\n?)((?:(?:\n|\r\n?).+)+)", re.MULTILINE)
BTW, you don't want to use the DOTALL modifier here; you're relying on the fact that the dot matches everything except newlines.
Regular Expressions match a block from multiline html text
I tried the xml.etree.ElementTree
module as explained by @kazbeel but it gave me error of "mismatched tag", which I found is the case in most instances of its usage. Then I found this BeautifulSoup module and used it, and it gave the desired results. The following code has covered another file pattern along with the above ones from the question.
File3:
<input id="realm_90" type="hidden" name="horizon" value="RADIUS">
Code:
from bs4 import BeautifulSoup ## module for parsing xml/html files
def get_realms(html_text):
realms=[]
soup=BeautifulSoup(html_text, 'lxml')
in_tag=soup.find(attrs={"name":"horizon"})
if in_tag.name == 'select':
for tag in in_tag.find_all():
realms.append(tag.attrs['value'])
elif in_tag.name == 'input':
realms.append(in_tag.attrs['value'])
return realms
I agree with @ZiTAL to not to use regular expressions when parsing xml/html files because it gets too complicated and there are number of libraries present for them.
Regex matching pattern in multiple lines without specific word in the match
You might use
^PAT_A[^;\n]*(?:\n(?![^\n;]*NOT_MATCH_THIS)[^;\n]*)*\n[^;\n]*PAT_B[^;]*;
In parts, the pattern matches:
^
Start of stringPAT_A
Match literally[^;\n]*
Optionally match any char except;
or a newline(?:
Non capture group (to repeat as a whole)\n(?![^\n;]*NOT_MATCH_THIS)
Match a newline, and assert that the string does not containNOT_MATCH_THIS
and does not contain a;
or a newline to stay on the same line[^;\n]*
If the previous assertion is true, match the whole line (no containing a;
)
)*
Close the non capture group, and optionally repeat matching all lines\n[^;\n]*
Match a newline, and any char except;
or a newlinePAT_B[^;]*;
Then match PAT_B followed by any char except;
followed by matching the;
Regex demo
Regex matching over multiple lines
Your tries were pretty close. In the first one you probably need to set the flag that allows the .
to match line feeds. It normally doesn't. In your second, you need to set the non-greedy ?
mode on the anything match .*
. Otherwise .*
tries to match the entire rest of the text.
It would be something like this. /^ <br>\n\d+\s[a-zA-Z"“](.*?\n)*?<hr\/>/
But anyway, this is something that is best done in Perl. Perl is where all the advanced regex comes from.
use strict;
use diagnostics;
our $text =<<EOF;
The figure that now stood by its bows was tall and swart, with one white tooth <br>
evilly protruding from its steel-like lips. <br>
<br>
1 "Hardly" had they pulled out from under the ship’s lee, when a <br>
fourth keel, coming from the windward side, pulled round under the stern, <br>
and showed the five strangers <br>
127 <br>
<br>
<hr/>
More text.
EOF
our $regex = qr{^ <br>\n\d+ +[A-Z"“].*?<hr/>}ism;
$text =~ s/($regex)/<!-- Removed -->/;
print "Removed text:\n[$1]\n\n";
print "New text:\n[$text]\n";
That prints:
Removed text:
[ <br>
1 "Hardly" had they pulled out from under the ship’s lee, when a <br>
fourth keel, coming from the windward side, pulled round under the stern, <br>
and showed the five strangers <br>
127 <br>
<br>
<hr/>]
New text:
[The figure that now stood by its bows was tall and swart, with one white tooth <br>
evilly protruding from its steel-like lips. <br>
<!-- Removed -->
More text.
]
The qr
operator builds a regular expression so that it can be stored in a variable. The ^
at the beginning means to anchor this match at the beginning of a line. The ism
on the end stands for case i
nsensitive, s
ingle string, m
ultiple embedded lines. s
allows .
to match line feeds. m
allows ^
to match at the beginning of lines embedded in the string. You would add a g
flag to end of the substitution to do a global replacement. s///g
The Perl regex documentation explains everything.
https://perldoc.perl.org/perlretut
See also Multiline replace in perl with extended expressions not working.
HTH
Python regex match across multiple lines
You may use
config_value = 'Example'
pattern=r'(?sm)^{}=(.*?)(?=[\r\n]+\w+=|\Z)'.format(config_value)
match = re.search(pattern, s)
if match:
print(match.group(1))
See the Python demo.
Pattern details
(?sm)
-re.DOTALL
andre.M
are on^
- start of a lineExample=
- a substring(.*?)
- Group 1: any 0+ chars, as few as possible(?=[\r\n]+\w+=|\Z)
- a positive lookahead that requires the presence of 1+ CR or LF symbols followed with 1 or more word chars followed with a=
sign, or end of the string (\Z
).
See the regex demo.
Match multiline text using regular expression
First, you're using the modifiers under an incorrect assumption.
Pattern.MULTILINE
or (?m)
tells Java to accept the anchors ^
and $
to match at the start and end of each line (otherwise they only match at the start/end of the entire string).
Pattern.DOTALL
or (?s)
tells Java to allow the dot to match newline characters, too.
Second, in your case, the regex fails because you're using the matches()
method which expects the regex to match the entire string - which of course doesn't work since there are some characters left after (\\W)*(\\S)*
have matched.
So if you're simply looking for a string that starts with User Comments:
, use the regex
^\s*User Comments:\s*(.*)
with the Pattern.DOTALL
option:
Pattern regex = Pattern.compile("^\\s*User Comments:\\s+(.*)", Pattern.DOTALL);
Matcher regexMatcher = regex.matcher(subjectString);
if (regexMatcher.find()) {
ResultString = regexMatcher.group(1);
}
ResultString
will then contain the text after User Comments:
How do I match any character across multiple lines in a regular expression?
It depends on the language, but there should be a modifier that you can add to the regex pattern. In PHP it is:
/(.*)<FooBar>/s
The s at the end causes the dot to match all characters including newlines.
Matching regular expression to multiple line blocks in python
There are numerous ways that this could be solved, here is one. I've added the information to a dictionary, where you will get a list of dictionaries as an output.
def parse_doc(filename):
with open(filename) as f:
pattern1 = re.compile(r'<sec name="(\D_\d\d_\w+)"\s+sound_freq="(\D\D\D\d+:\d+-\d+)"')
pattern2 = re.compile(r'<per fre="(Volum_+\d+Kb)"+\svalue="(\d+.+)"')
doc = []
for i in f.readlines():
p1 = re.match(pattern1, i)
p2 = re.match(pattern2, i)
line = {}
if p1:
line.update({'sec': p1.group(1), 'sound_freq': p1.group(2)})
if p2:
line.update({p2.group(1): p2.group(2)})
if len(line)>0:
doc.append(line)
return doc
print(parse_doc('doc.txt'))
Output
[{'sec': 'M_20_K40745170', 'sound_freq': 'mhr17:7907527-7907589'}, {'Volum_5Kb': '89.00'}, {'Volum_40Kb': '00.00'}, {'Volum_70Kb': '77.00'}]
If you want to get all the values you could get it using the following:
def parse_doc_all(filename):
with open(filename) as f:
pattern1 = re.compile(r'(.|\w+)="([^\s]+)"')
doc = {}
for i in f.readlines():
doc.update({p[0]: p[1] for p in re.findall(pattern1, i)})
return doc
print(parse_doc_all('doc.txt'))
Which will give you
{'name': 'M_20_K40745170', 'sound_freq': 'mhr17:7907527-7907589', 'tension': 'SGCGSCGSCGSCGSC', 's_c': '0', 'number': '5748', 'v': '0.1466469683747654', 'y': '0.0', 'units': 'sec', 'first_name': 'g7tty', 'description': 'xyz', 'abc': 'trt', 'id': 'abc', 'fre': 'Volum_70Kb', 'value': '77.00'}
Python regex, matching pattern over multiple lines.. why isn't this working?
Try re.findall(r"####(.*?)\s(.*?)\s####", string, re.DOTALL)
(works with re.compile
too, of course).
This regexp will return tuples containing the number of the section and the section content.
For your example, this will return [('1', 'ttteest'), ('2', ' \n\nttest')]
.
(BTW: your example won't run, for multiline strings, use '''
or """
)
Related Topics
Dictionaries and Default Values
How to Pass a List as a Command-Line Argument with Argparse
Matplotlib: Format Axis Offset-Values to Whole Numbers or Specific Number
Python: Calling 'List' on a Map Object Twice
Disable Tensorflow Debugging Information
Polling the Keyboard (Detect a Keypress) in Python
How to Pretty Print Nested Dictionaries
Python:List Index Out of Range Error While Iteratively Popping Elements
Python Operator Precedence of in and Comparison
Safe Method to Get Value of Nested Dictionary
List Comprehension with If Statement
Parsing Boolean Values with Argparse
Making a String Out of a String and an Integer in Python
Sftp in Python? (Platform Independent)
Grouping Python Dictionary Keys as a List and Create a New Dictionary with This List as a Value
Efficient Way to Apply Multiple Filters to Pandas Dataframe or Series