How to Make Dot Match Newline Characters Using Regular Expressions

How to make dot match newline characters using regular expressions

You need to use the DOTALL modifier (/s).

'/<div>(.*)<\/div>/s'

This might not give you exactly what you want because you are greedy matching. You might instead try a non-greedy match:

'/<div>(.*?)<\/div>/s'

You could also solve this by matching everything except '<' if there aren't other tags:

'/<div>([^<]*)<\/div>/'

Another observation is that you don't need to use / as your regular expression delimiters. Using another character means that you don't have to escape the / in </div>, improving readability. This applies to all the above regular expressions. Here's it would look if you use '#' instead of '/':

'#<div>([^<]*)</div>#'

However all these solutions can fail due to nested divs, extra whitespace, HTML comments and various other things. HTML is too complicated to parse with Regex, so you should consider using an HTML parser instead.

How do I match any character across multiple lines in a regular expression?

It depends on the language, but there should be a modifier that you can add to the regex pattern. In PHP it is:

/(.*)<FooBar>/s

The s at the end causes the dot to match all characters including newlines.

How does the dot metacharacter match newline characters?

The page here http://www.regular-expressions.info/dot.html explains how the rule that dot does not match the end-of-line character exists mostly for historic reasons:

The first tools that used regular expressions were line-based. They would read a file line by line, and apply the regular expression separately to each line. The effect is that with these tools, the string could never contain line breaks, so the dot could never match them.

However,

Modern tools and languages can apply regular expressions to very large strings or even entire files. Except for JavaScript and VBScript, all regex flavors discussed here have an option to make the dot match all characters, including line breaks.

Apparently, R is one such language where by default, dot will match every character. (I point you to Joshua's comment above, recommending you look at ?regex and the POSIX 1003.2 standard.)


The page I linked above also mentions Perl and suggests how under its default mode, dot will not match line breaks.

Notice how R's grep function has a perl option. If you turn it on, you do get a different output:

> grep(".", c("\r","\n","\r\n"), perl = TRUE)
[1] 1 3

This is telling me that \n is the line break character, but not \r. Something that comparing cat("\r") and cat("\n") can confirm.

(I'm on a Mac OS if it makes any difference.)

Match linebreaks - \n or \r\n?

I will answer in the opposite direction.


  1. For a full explanation about \r and \n I have to refer to this question, which is far more complete than I will post here: Difference between \n and \r?

Long story short, Linux uses \n for a new-line, Windows \r\n and old Macs \r. So there are multiple ways to write a newline. Your second tool (RegExr) does for example match on the single \r.

  1. [\r\n]+ as Ilya suggested will work, but will also match multiple consecutive new-lines. (\r\n|\r|\n) is more correct.

matching any character including newlines in a Python regex subexpression, not globally

To match a newline, or "any symbol" without re.S/re.DOTALL, you may use any of the following:

  1. (?s). - the inline modifier group with s flag on sets a scope where all . patterns match any char including line break chars

  2. Any of the following work-arounds:

[\s\S]
[\w\W]
[\d\D]

The main idea is that the opposite shorthand classes inside a character class match any symbol there is in the input string.

Comparing it to (.|\s) and other variations with alternation, the character class solution is much more efficient as it involves much less backtracking (when used with a * or + quantifier). Compare the small example: it takes (?:.|\n)+ 45 steps to complete, and it takes [\s\S]+ just 2 steps.

See a Python demo where I am matching a line starting with 123 and up to the first occurrence of 3 at the start of a line and including the rest of that line:

import re
text = """abc
123
def
356
more text..."""
print( re.findall(r"^123(?s:.*?)^3.*", text, re.M) )
# => ['123\ndef\n356']
print( re.findall(r"^123[\w\W]*?^3.*", text, re.M) )
# => ['123\ndef\n356']

regex new line match inside an angle brackets

Change .* to [^>]*. By default . matches everything except newline, so .* won't match across multiple lines.

<Button[^>]*className[^>]*>

DEMO

What Raku regex modifier makes a dot match a newline (like Perl's /s)?

TL;DR The Raku equivalent for "Perl dot matches newline" is ., and for \Q...\E it's ....

There are ways to get better answers (more authoritative, comprehensive, etc than SO ones) to most questions like these more easily (typically just typing the search term of interest) and quickly (typically seconds, couple minutes tops). I address that in this answer.

What is Raku equivalent for "Perl dot matches newline"?

Just .

If you run the following Raku program:

/./s

you'll see the following error message:

Unsupported use of /s.  In Raku please use: .  or \N.

If you type . in the doc site's search box it lists several entries. One of them is . (regex). Clicking it provides examples and says:

An unescaped dot . in a regex matches any single character. ...

Notably . also matches a logical newline \n

My guess is you either didn't look for answers before asking here on SO (which is fair enough -- I'm not saying don't; that said you can often easily get good answers nearly instantly if you look in the right places, which I'll cover in this answer) or weren't satisfied by the answers you got (in which case, again, read on).

In case I've merely repeated what you've already read, or it's not enough info, I will provide a better answer below, after I write up an initial attempt to give a similar answer for your \Q...\E question -- and fail when I try the doc step.

What is Raku equivalent for Perl \Q...\E?

'...', or $foo if the ... was metasyntax for a variable name.

If you run the following Raku program:

/\Qfoo\E/

you'll see the following error message:

Unsupported use of \Q as quotemeta.  In Raku please use: quotes or
literal variable match.

If you type \Q...\E in the doc site's search box it lists just one entry: Not in Index (try site search). If you go ahead and try the search as suggested, you'll get matching pages according to google. For me the third page/match listed (Perl to Raku guide - in a nutshell: "using String::ShellQuote (because \Q…\E is not completely right) ...") is the only true positive match of \Q...\E among 27 matches. And it's obviously not what you're interested in.

So, searching the doc for \S...\E appears to be a total bust.


How does one get answers to a question like "what is the Raku equivalent of Perl's \Q...\E?" if the doc site ain't helpful (and one doesn't realize Rakudo happens to have a built in error message dedicated to the exact thing of interest and/or isn't sure what the error message means)? What about questions where neither Rakudo nor the doc site are illuminating?

SO is one option, but what lets folk interested in Raku frequently get good/great answers to their questions easily and quickly when they can't get them from the doc site because the answer is hard to find or simply doesn't exist in the docs?

Easily get better answers more quickly than asking SO Qs

The docs website doesn't always yield a good answer to simple questions. Sometimes, as we clearly see with the \Q...\E case, it doesn't yield any answer at all for the relevant search term.

Fortunately there are several other easily searchable sources of rich and highly relevant info that often work when the doc site does not for certain kinds of info/searches. This is especially likely if you've got precise search terms in mind such as /s or \Q...\E and/or are willing browse info provided it's high signal / low noise. I'll introduce two of these resources in the remainder of this answer.

Archived "spec" docs

Raku's design was written up in a series of "spec" docs written principally by Larry Wall over a 2 decade period.

(The word "specs" is short for "specification speculations". It's both ultra authoritative detailed and precise specifications of the Raku language, authored primarily by Larry Wall himself, and mere speculations -- because it was all subject to implementation. And the two aspects are left entangled, and now out-of-date. So don't rely on them 100% -- but don't ignore them either.)

The "specs", aka design docs, are a fantastic resource. You can search them using google by entering your search terms in the search box at design.raku.org.


A search for /s lists 25 pages. The only useful match is Synopsis 5: Regexes and Rules ("24 Jun 2002 — There are no /s or /m modifiers (changes to the meta-characters replace them - see below)." Click it. Then do an in-page search for /s (note the space). You'll see 3 matches:

There are no /s or /m modifiers (changes to the meta-characters replace them - see below)

A dot . now matches any character including newline. (The /s modifier is gone.)

. matches an anything, while \N matches an anything except what \n matches. (The /s modifier is gone.) In particular, \N matches neither carriage return nor line feed.


A search for \Q...\E lists 7 pages. The only useful match is again Synopsis 5: Regexes and Rules ("24 Jun 2002 — \Q$var\E / ..."). Click it. Then do an in-page search for \Q. You'll see 2 matches:

In Raku / $var / is like a Perl / \Q$var\E /

\Q...\E sequences are gone.

Chat logs

I've expanded the Quicker answers section of my answer to one of your earlier Qs to discuss searching the Raku "chat logs". They are an incredibly rich mine of info with outstanding search features. Please read that section of my prior answer for clear general guidance. The rest of this answer will illustrate for /s and \Q...\E.


A search for the regex / newline . ** ^200 '/s' / in the old Raku channel from 2010 thru 2015 found this match:

. matches an anything, while \N matches an anything except what \n matches. (The /s modifier is gone.) In particular, \N matches neither carriage return nor line feed.

Note the shrewdness of my regex. The pattern is the word "newline" (which is hopefully not too common) followed within 200 characters by the two character sequence /s (which I suspect is more common than newline). And I constrained to 2010-2014 because a search for that regex of the entire 15 years of the old Raku channel would tax Liz's server and time out. I got that hit I've quoted above within a couple minutes of trying to find some suitable match of /s (not end-of-sarcasm!).


A search for \Q in the old Raku channel was an immediate success. Within 30 seconds of the thought "I could search the logs" I had a bunch of useful matches.

Is there any way to have dot (.) match newline in C++ TR1 Regular Expressions?

Boost.Regex has a mod_s flag to make the dot match newlines, but it's not part of the TR1 regex standard. (and not available as a Microsoft extension either, as far as I can see)

As a workaround, you could use [\s\S] (which means match any whitespace or any non-whitespace).

Regular Expression to match every new line character (\n) inside a content tag

Actually... you can't use a simple regex here, at least not one. You probably need to worry about comments! Someone may write:

<!-- <content> blah </content> -->

You can take two approaches here:

  1. Strip all comments out first. Then use the regex approach.
  2. Do not use regular expressions and use a context sensitive parsing approach that can keep track of whether or not you are nested in a comment.

Be careful.

I am also not so sure you can match all new lines at once. @Quartz suggested this one:

<content>([^\n]*\n+)+</content>

This will match any content tags that have a newline character RIGHT BEFORE the closing tag... but I'm not sure what you mean by matching all newlines. Do you want to be able to access all the matched newline characters? If so, your best bet is to grab all content tags, and then search for all the newline chars that are nested in between. Something more like this:

<content>.*</content>

BUT THERE IS ONE CAVEAT: regexes are greedy, so this regex will match the first opening tag to the last closing one. Instead, you HAVE to suppress the regex so it is not greedy. In languages like python, you can do this with the "?" regex symbol.

I hope with this you can see some of the pitfalls and figure out how you want to proceed. You are probably better off using an XML parsing library, then iterating over all the content tags.

I know I may not be offering the best solution, but at least I hope you will see the difficulty in this and why other answers may not be right...

UPDATE 1:

Let me summarize a bit more and add some more detail to my response. I am going to use python's regex syntax because it is what I am more used to (forgive me ahead of time... you may need to escape some characters... comment on my post and I will correct it):

To strip out comments, use this regex:

Notice the "?" suppresses the .* to make it non-greedy.

Similarly, to search for content tags, use:
.*?

Also, You may be able to try this out, and access each newline character with the match objects groups():

<content>(.*?(\n))+.*?</content>

I know my escaping is off, but it captures the idea. This last example probably won't work, but I think it's your best bet at expressing what you want. My suggestion remains: either grab all the content tags and do it yourself, or use a parsing library.

UPDATE 2:

So here is python code that ought to work. I am still unsure what you mean by "find" all newlines. Do you want the entire lines? Or just to count how many newlines. To get the actual lines, try:

#!/usr/bin/python

import re

def FindContentNewlines(xml_text):
# May want to compile these regexes elsewhere, but I do it here for brevity
comments = re.compile(r"<!--.*?-->", re.DOTALL)
content = re.compile(r"<content>(.*?)</content>", re.DOTALL)
newlines = re.compile(r"^(.*?)$", re.MULTILINE|re.DOTALL)

# strip comments: this actually may not be reliable for "nested comments"
# How does xml handle <!-- <!-- --> -->. I am not sure. But that COULD
# be trouble.
xml_text = re.sub(comments, "", xml_text)

result = []
all_contents = re.findall(content, xml_text)
for c in all_contents:
result.extend(re.findall(newlines, c))

return result

if __name__ == "__main__":
example = """

<!-- This stuff
ought to be omitted
<content>
omitted
</content>
-->

This stuff is good
<content>
<p>
haha!
</p>
</content>

This is not found
"""
print FindContentNewlines(example)

This program prints the result:

 ['', '<p>', '  haha!', '</p>', '']

The first and last empty strings come from the newline chars immediately preceeding the first <p> and the one coming right after the </p>. All in all this (for the most part) does the trick. Experiment with this code and refine it for your needs. Print out stuff in the middle so you can see what the regexes are matching and not matching.

Hope this helps :-).

PS - I didn't have much luck trying out my regex from my first update to capture all the newlines... let me know if you do.



Related Topics



Leave a reply



Submit