.Net Regex Matching $ with the End of the String and Not of Line, Even with Multiline Enabled

.Net regex matching $ with the end of the string and not of line, even with multiline enabled

It is clear your text contains a linebreak other than LF. In .NET regex, a dot matches any char but LF (a newline char, \n).

See Multiline Mode MSDN regex reference

By default, $ matches only the end of the input string. If you specify the RegexOptions.Multiline option, it matches either the newline character (\n) or the end of the input string. It does not, however, match the carriage return/line feed character combination. To successfully match them, use the subexpression \r?$ instead of just $.

So, use

@"^(#+).+?\r?$"

The .+?\r?$ will match lazily any one or more chars other than LF up to the first CR (that is optional) right before a newline.

Or just use a negated character class:

@"^(#+)[^\r\n]+"

The [^\r\n]+ will match one or more chars other than CR/LF.

Trivial multiline regex fails in .NET but succeeds in ECMAScript - why?

This is caused by the fact that '$' in multiline mode matches a '\n', not '\r\n', which is the default linebreak on Windows. The solution is simply to add '\r?' in front of the '$' linebreak, like this:

^using ([\w\.]+);\r?$

Now it will match both '\n' and '\r\n'.

Edit:

When you enter a multiline text on RegEx101, they use '\n' as linebreaks, that's why it Works on their site.

How do I match any character across multiple lines in a regular expression?

It depends on the language, but there should be a modifier that you can add to the regex pattern. In PHP it is:

/(.*)<FooBar>/s

The s at the end causes the dot to match all characters including newlines.

Regular Expression Match variable multiple lines?

You should use the SingleLine mode which tells your C# regular expression that . matches any character (not any character except \n).

var regex = new Regex("Start of numbers(.*)End of numbers",
RegexOptions.IgnoreCase | RegexOptions.Singleline);

Regular expression to match a line that doesn't contain a word

The notion that regex doesn't support inverse matching is not entirely true. You can mimic this behavior by using negative look-arounds:

^((?!hede).)*$

Non-capturing variant:

^(?:(?!:hede).)*$

The regex above will match any string, or line without a line break, not containing the (sub)string 'hede'. As mentioned, this is not something regex is "good" at (or should do), but still, it is possible.

And if you need to match line break chars as well, use the DOT-ALL modifier (the trailing s in the following pattern):

/^((?!hede).)*$/s

or use it inline:

/(?s)^((?!hede).)*$/

(where the /.../ are the regex delimiters, i.e., not part of the pattern)

If the DOT-ALL modifier is not available, you can mimic the same behavior with the character class [\s\S]:

/^((?!hede)[\s\S])*$/

Explanation

A string is just a list of n characters. Before, and after each character, there's an empty string. So a list of n characters will have n+1 empty strings. Consider the string "ABhedeCD":

    ┌──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┐
S = │e1│ A │e2│ B │e3│ h │e4│ e │e5│ d │e6│ e │e7│ C │e8│ D │e9│
└──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┘

index 0 1 2 3 4 5 6 7

where the e's are the empty strings. The regex (?!hede). looks ahead to see if there's no substring "hede" to be seen, and if that is the case (so something else is seen), then the . (dot) will match any character except a line break. Look-arounds are also called zero-width-assertions because they don't consume any characters. They only assert/validate something.

So, in my example, every empty string is first validated to see if there's no "hede" up ahead, before a character is consumed by the . (dot). The regex (?!hede). will do that only once, so it is wrapped in a group, and repeated zero or more times: ((?!hede).)*. Finally, the start- and end-of-input are anchored to make sure the entire input is consumed: ^((?!hede).)*$

As you can see, the input "ABhedeCD" will fail because on e3, the regex (?!hede) fails (there is "hede" up ahead!).

Regular expression works in tester but not in my code

I probobaly solve the problem. I removed the ^ and $ in the pattern and now I have 1 match.

It seems that if the pattern itself has multiple lines, you should not put ^ and $ in middle lines.

            string pattern =
@"^(?<p1>.*?)(?<c0>\w+)(?<s1>.*?)
(?<p2>.*?)\k<c0>(?<s2>.*?)
\k<p1>(?<c1>\w+)\k<s1>
\k<p2>\k<c1>\k<s2>$";

string text =
@" if (forwardRadioButton.IsChecked.Value)
car = car.Forward(distance);
else if (backwardRadioButton.IsChecked.Value)
car = car.Backward(distance);
else if (forwardLeftRadioButton.IsChecked.Value)
car = car.ForwardLeft(distance);";

var mc = Regex.Matches(text, pattern, RegexOptions.IgnoreCase | RegexOptions.Multiline);

Console.WriteLine(mc.Count);
Console.ReadKey();

Regex to match exact string (do not allow terminating newline)

Solution for the current .NET regex

You should use the very end of string anchor that is \z in .NET regex:

Regex regexExact = new Regex(@"^abc\z");

See Anchors in Regular Expressions:

$    The match must occur at the end of the string or line, or before \n at the end of the string or line. For more information, see End of String or Line.

\Z    The match must occur at the end of the string, or before \n at the end of the string. For more information, see End of String or Before Ending Newline.

\z    The match must occur at the end of the string only. For more information, see End of String Only.

The same anchor can be used in .net, java, pcre, delphi, ruby and php. In python, use \Z. In JavaScript RegExp (ECMAScript) compatible patterns, the $ anchor matches the very end of string (if no /m modifier is defined).

Background

see Strings Ending with a Line Break at regular-expressions.info:

Because Perl returns a string with a newline at the end when reading a line from a file, Perl's regex engine matches $ at the position before the line break at the end of the string even when multi-line mode is turned off. Perl also matches $ at the very end of the string, regardless of whether that character is a line break. So ^\d+$ matches 123 whether the subject string is 123 or 123\n.

Most modern regex flavors have copied this behavior. That includes .NET, Java, PCRE, Delphi, PHP, and Python. This behavior is independent of any settings such as "multi-line mode".

In all these flavors except Python, \Z also matches before the final line break. If you only want a match at the absolute very end of the string, use \z (lower case z instead of upper case Z). \A\d+\z does not match 123\n. \z matches after the line break, which is not matched by the shorthand character class.

In Python, \Z matches only at the very end of the string. Python does not support \z.



Related Topics



Leave a reply



Submit