Matching Any Character Including Newlines in a Python Regex Subexpression, Not Globally

matching any character including newlines in a Python regex subexpression, not globally

To match a newline, or "any symbol" without re.S/re.DOTALL, you may use any of the following:

  1. (?s). - the inline modifier group with s flag on sets a scope where all . patterns match any char including line break chars

  2. Any of the following work-arounds:

[\s\S]
[\w\W]
[\d\D]

The main idea is that the opposite shorthand classes inside a character class match any symbol there is in the input string.

Comparing it to (.|\s) and other variations with alternation, the character class solution is much more efficient as it involves much less backtracking (when used with a * or + quantifier). Compare the small example: it takes (?:.|\n)+ 45 steps to complete, and it takes [\s\S]+ just 2 steps.

See a Python demo where I am matching a line starting with 123 and up to the first occurrence of 3 at the start of a line and including the rest of that line:

import re
text = """abc
123
def
356
more text..."""
print( re.findall(r"^123(?s:.*?)^3.*", text, re.M) )
# => ['123\ndef\n356']
print( re.findall(r"^123[\w\W]*?^3.*", text, re.M) )
# => ['123\ndef\n356']

Can I match multiline string in python without using re.DOTALL?

You can achieve the same by using a little trick like this [\s\S].

The idea behind [\s\S] is to capture everything, so you can delimit what you want using an explicit pattern. For instance:

/\*        <--- Match /*
[\s\S]*? <--- Match everything (ungreedy)
\*/ <--- Match */

You can use a regex like this:

/\*[\s\S]*?\*/

If you want to capture the content within the comment then you could do:

/\*([\s\S]*?)\*/

Working demo

You can see how this trick works below:

Sample Image

Btw, you are using a greedy regex /\*.*\*/ that will wrongly match comments. For instance, if you have:

/* A */
/* B */

You regex will wrongly match /* A *//* B */. You have to add ? to set it as ungreedy as this:

/\*.*?\*/
^--- ungreedy

Python returns no matches on working regex

The re.DOTALL flag is missing. Without it . won't match newlines.

But we can do better (depending on what you need exactly of course): https://regex101.com/r/iN6pX6/199

import re
import pprint

titles = '''
[Omitted for brevity]
..
'''

pattern = r'''
(?P<title>[^\n]+)\n
(?P<subtitle>[^\n]+)\n
((?P<etc>[^\n].*?)\n\n|\n)
'''

# Make sure we don't have any extraneous whitespace but add the separator
titles = titles.strip() + '\n\n'

for match in re.finditer(pattern, titles, re.DOTALL | re.VERBOSE):
title = match.group('title')
subtitle = match.group('subtitle')
etc = match.group('etc')
print('## %r' % title)
print('# %r' % subtitle)
if etc:
print(etc)
print()
# pprint.pprint(match.groupdict())

PHP Regex Match Newlines and Characters Except

Just use:

[^;]+

This will match newlines (since they are not ;), you just have to tell it to match more than just one character.

Demo: Regex101


Implementation:

$string = 'foo
bar
abc;123
test
';

preg_match_all('/[^;]+/', $string, $matches);
var_dump($matches);

// array(1) {
// [0]=>
// array(2) {
// [0]=>
// string(13) "foo
// bar
// abc"
// [1]=>
// string(11) "123
// test
// "
// }
// }

Alternative:

Is there a chance you just want to use str_replace() or explode(), here are examples using the same $string as above:

$string = str_replace(';', '', $string);
var_dump($string);

// string(24) "foo
// bar
// abc123
// test
// "

OR

$parts = explode(';', $string);
var_dump($parts);

// array(2) {
// [0]=>
// string(13) "foo
// bar
// abc"
// [1]=>
// string(11) "123
// test
// "
// }

RegEx in Sublime Text: Match any character, including newlines?

Try adding the (?s) inline flag start the start of the pattern. That will make . match any character.

Regular expression to match a line that doesn't contain a word

The notion that regex doesn't support inverse matching is not entirely true. You can mimic this behavior by using negative look-arounds:

^((?!hede).)*$

Non-capturing variant:

^(?:(?!:hede).)*$

The regex above will match any string, or line without a line break, not containing the (sub)string 'hede'. As mentioned, this is not something regex is "good" at (or should do), but still, it is possible.

And if you need to match line break chars as well, use the DOT-ALL modifier (the trailing s in the following pattern):

/^((?!hede).)*$/s

or use it inline:

/(?s)^((?!hede).)*$/

(where the /.../ are the regex delimiters, i.e., not part of the pattern)

If the DOT-ALL modifier is not available, you can mimic the same behavior with the character class [\s\S]:

/^((?!hede)[\s\S])*$/

Explanation

A string is just a list of n characters. Before, and after each character, there's an empty string. So a list of n characters will have n+1 empty strings. Consider the string "ABhedeCD":

    ┌──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┐
S = │e1│ A │e2│ B │e3│ h │e4│ e │e5│ d │e6│ e │e7│ C │e8│ D │e9│
└──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┘

index 0 1 2 3 4 5 6 7

where the e's are the empty strings. The regex (?!hede). looks ahead to see if there's no substring "hede" to be seen, and if that is the case (so something else is seen), then the . (dot) will match any character except a line break. Look-arounds are also called zero-width-assertions because they don't consume any characters. They only assert/validate something.

So, in my example, every empty string is first validated to see if there's no "hede" up ahead, before a character is consumed by the . (dot). The regex (?!hede). will do that only once, so it is wrapped in a group, and repeated zero or more times: ((?!hede).)*. Finally, the start- and end-of-input are anchored to make sure the entire input is consumed: ^((?!hede).)*$

As you can see, the input "ABhedeCD" will fail because on e3, the regex (?!hede) fails (there is "hede" up ahead!).

RegEx - How to find all occurrences, including newlines, between two points?

You might use a pattern with a capture group and a negative lookahead:

\bchanged:\s*true((?:\n(?!\s*\b(?:newTranslated:\s*''|importTokens:|changed:)).*)*)(?=\n\s*\bimportedTokens:)

The pattern matches:

  • \bchanged:\s*true Match the word changed and then : optional spaces and true
  • ( Capture group 1 (that has the data you are interested in)
    • (?: Non capture group to repeat as a whole part
    • \n Match a newline
    • (?!\s*\b(?:newTranslated:\s*''|importTokens:|changed:)).*)* Match the whole line if is does not contain any of the alternatives
  • ) Close group 1
  • (?=\n\s*\bimportedTokens:) Positive lookahead, assert a newline and importedTokens:


Related Topics



Leave a reply



Submit