How to Match a Newline Character in a Raw String

How to match a newline character in a raw string?

In a regular expression, you need to specify that you're in multiline mode:

>>> import re
>>> s = """cat
... dog"""
>>>
>>> re.match(r'cat\ndog',s,re.M)
<_sre.SRE_Match object at 0xcb7c8>

Notice that re translates the \n (raw string) into newline. As you indicated in your comments, you don't actually need re.M for it to match, but it does help with matching $ and ^ more intuitively:

>> re.match(r'^cat\ndog',s).group(0)
'cat\ndog'
>>> re.match(r'^cat$\ndog',s).group(0) #doesn't match
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
>>> re.match(r'^cat$\ndog',s,re.M).group(0) #matches.
'cat\ndog'

Regex for newline character search in given string in Python

As others have noted, you are probably looking for line.strip(). But, in case you still want to practice regex, you would use the following code:

Line="got less no of bytes than requested\r\n"

# \r\n located anywhere in the string
prog = re.compile(r'\r\n')
# \r or \n located anywhere in the string
prog = re.compile(r'(\r|\n)')


if prog.search(Line):
print('Do not use \\r\\n in MSG');

Different way to specify matching new line in Python regex

The combo \n indicates a 'newline character' in both Python itself and in re expressions as well (https://docs.python.org/2.0/ref/strings.html).

In a regular Python string, \n gets translated to a newline. The newline code is then fed into the re parser as a literal character.

A double backslash in a Python string gets translated to a single one. Therefore, a string "\\n" gets stored internally as "\n", and when sent to the re parser, it in turn recognizes this combo \n as indicating a newline code.

The r notation is a shortcut to prevent having to enter double double backslashes:

backslashes are not handled in any special way in a string literal prefixed with 'r' (https://docs.python.org/2/library/re.html)

Output Substring to Newline from a Raw Text String using Regex

Note:

  • I'm assuming you're looking for the whole line on which BOURKE appears as a substring.

  • In your own attempts, (?<BOURKE>...) simply gives the regex capture group a self-chosen name (BOURKE), which is unrelated to what the capture group's subexpression (...) actually matches.

  • For the use case at hand, there's no strict need to use a (named) capture group at all, so the solutions below make do without one, which, when the -match operator is used, means that the result of a successful match is reported in index [0] of the automatic $Matches variable, as shown below.


If your multiline input string contains only Unix-format LF newlines (\n), use the following:

if ($multiLineStr -match '.*BOURKE.*') { $Matches[0] }

Note:

  • To match case-sensitively, use -cmatch instead of -match.
  • If you know that the substring is preceded / followed by at least one char., use .+ instead of .*
  • If you want to search for the substring verbatim and it happens to or may contain regex metacharacters (e.g. . ), apply [regex]::Escape() to it; e.g, [regex]::Escape('file.txt') yields file\.txt (\-escaped metacharacters).
  • If necessary, add additional constraints for disambiguation, such as requiring that the substring start or end only at word boundaries (\b)

If there's a chance that Windows-format CLRF newlines (\r\n) are present , use:

if ($multiLineStr -match '.*BOURKE[^\r\n]*') { $Matches[0] }

For an explanation of the regexes and the ability to experiment with them, see this regex101.com page for .*BOURKE.*, and this one for .*BOURKE[^\r\n]*

In short:

  • By default, . matches any character except \n, which obviates the need for line-specific anchors (^ and $) altogether, but with CRLF newlines requires excluding \r so as not to capture it as part of the match.[1]

Two asides:

  • PowerShell's -match operator only ever looks for one match; if you need to find all matches, you currently need to use the underlying [regex] API directly; e.g., [regex]::Matches($multiLineStr, '.*BOURKE[^\r\n]*').Value, 'IgnoreCase'
    GitHub issue #7867 suggests bringing this functionality directly to PowerShell in the form of a -matchall operator.

  • If you want to anchor the substring to find, i.e. if you want to stipulate that it either occur at the start or at the end of a line, you need to switch to multi-line mode ((?m)), which makes ^ and $ match on each line; e.g., to only match if BOURKE occurs at the very start of a line:

    • if ($multiLineStr -match '(?m)^BOURKE[^\r\n]*') { $Matches[0] }

If line-by-line processing is an option:

  • Line-by-line processing has the advantage that you needn't worry about differences in newline formats (assuming the utility handling the splitting into lines can handle both newline formats, which is true of PowerShell in general).

  • If you're reading the input text from a file, the Select-String cmdlet, whose very purpose is to find the whole lines on which a given regex or literal substring (-SimpleMatch) matches, additionally offers streaming processing, i.e. it reads lines one by one, without the need to read the whole file into memory.

(Select-String -LiteralPath file.txt -Pattern BOURKE).Line

Add -CaseSensitive for case-sensitive matching.

The following example simulates the above (-split '\r?\n' splits the multiline input string into individual lines, recognizing either newline format):

(
@'
initial text
preliminary text
unfinished line bfore the line I want
001 BOURKE, Bridget Mary ....... ........... 13 Mahina Road, Mahina Bay.Producrs/As 002 BOURKE. David Gerard ...
line after the line I want
extra text
extra extra text
'@ -split '\r?\n' |
Select-String -Pattern BOURKE
).Line

Output:

001 BOURKE, Bridget Mary ....... ........... 13 Mahina Road, Mahina Bay.Producrs/As 002 BOURKE. David Gerard ...

[1] Strictly speaking, the [^\r\n]* would also stop matching at a \r character in isolation (i.e., even if not directly followed by \n). If ruling out that case is important (which seems unlikely), use a (simplified version of) the regex suggested by Mathias R. Jessen in a comment on the question: .*BOURKE.*?(?=\r?\n)

understanding raw string for regular expressions in python

Don't double the backslash when using raw string:

>>> pattern3 = r'\n\n'
>>> pattern3
'\\n\\n'
>>> re.findall(pattern3, text)
['\n\n']

get all the text between two newline characters(\n) of a raw_text using python regex

Try the following:

import re

text = '''1) \nTERMS \nDirect deposit; Routing #256078514, acct. #160935\n\n'
2) \nTerms\nDue on receipt\nDue Date\n1/31/2021
3) \nTERMS: \nNET 30 DAYS\n''' # \n are real new lines

for m in re.finditer(r'(TERMS|Terms)\W*\n(.*?)\n', text):
print(m.group(2))
  1. Note that your regex could not deal with the third 'line' because there is a colon : after TERMS. So I replaced \s with \W.

  2. ('TERMS' or 'Terms') in raw_text might not be what you want. It does not raise a syntax error, but it is just the same as 'TERMS' in raw_text; when python evaluates the parenthesis part, both 'TERMS' and 'Terms' are all truthy, and therefore python just takes the last truthy value, i.e., 'Terms'. The result is, TERMS cannot be picked up by that part!

    So you might instead want someting like ('TERMS' in raw_text) or ('Terms' in raw_text), although it is quite verbose.

Python RegEx Matching Newline

Don't use re.DOTALL or the dot will match newlines, too. Also use raw strings (r"...") for regexes:

for m in re.findall(r'[0-9]{8}.*\n.*\n.*\n.*\n.*', l):
print m

However, your version still should have worked (although very inefficiently) if you have read the entire file as binary into memory as one large string.

So the question is, are you reading the file like this:

with open("filename","rb") as myfile:
mydata = myfile.read()
for m in re.findall(r'[0-9]{8}.*\n.*\n.*\n.*\n.*', mydata):
print m

Or are you working with single lines (for line in myfile: or myfile.readlines())? In that case, the regex can't work, of course.

How to match line beginning and end in a multi-line string

problem is that since you're using raw strings for your string, \n is seen as ... well \ then n. Regexes will understand \n in the pattern, but not in the input string.

Also, even if not important there, always use flags= keyword, as some regex functions have an extra count parameter and that can lead to errors.

like this:

re.match(r".*^score = 0\.65$.*", "score = 0.65\nscore = 0.59\nscore = 1.0", flags=re.MULTILINE)
<_sre.SRE_Match object; span=(0, 12), match='score = 0.65'>

and as I noted in comments, .* needs re.DOTALL to match newlines

>>> re.match(r".*^score = \d+\.\d+$.*", "score = 0.65\nscore = 0.59\nscore = 1.0", re.MULTILINE|re.DOTALL)
<_sre.SRE_Match object; span=(0, 37), match='score = 0.65\nscore = 0.59\nscore = 1.0'>

(as noted in Python regex, matching pattern over multiple lines.. why isn't this working? and How do I match any character across multiple lines in a regular expression? of which this could be a duplicate if it wasn't for the raw string bit)

(sorry, my floating point regex is probably a bit weak, you can find better ones around)



Related Topics



Leave a reply



Submit