Why \B Does Not Match Word Using .Net Regex

Why \b does not match word using .net regex

The C# language and .NET Regular Expressions both have their own distinct set of backslash-escape sequences, but the C# compiler is intercepting the "\b" in your string and converting it into an ASCII backspace character so the RegEx class never sees it. You need to make your string verbatim (prefix with an at-symbol) or double-escape the 'b' so the backslash is passed to RegEx like so:

@"\bCOMPILATION UNIT";

Or

"\\bCOMPILATION UNIT"

I'll say the .NET RegEx documentation does not make this clear. It took me a while to figure this out at first too.

Fun-fact: The \r and \n characters (carriage-return and line-break respectively) and some others are recognized by both RegEx and the C# language, so the end-result is the same, even if the compiled string is different.

Using \b in a .NET regex

Sure. Your \b is actually the backspace character, not the regex \b. You need to either use "\\b" to embed this in a C# string literal, or use verbatim string literals: @"\b".

Remember: The backslash is an escape character for C# strings just as it is for regex, so if you're not careful, you need to escape things twice, once for the string literal, and once for the regex.

Another thing: Stay away from \b, the same with \w. \b is an anchor defined in terms of \w and \w is a character class that's pretty much useless for anything except quick one-off tasks where you have very tight control over everything you want to match. \b simply means that to one side of the anchor is a character matching \w and to the other side there isn't (either end of string or a character matching \W). Now, \w includes things like numbers, and _. If you search for vaguely word-like things at least I tend to not think of numbers and underscores as part of words. Oftentimes I like to make it explicit what it actually is what I'm looking for, e.g. via lookaround assertions: (?<!\p{L}) is a way of specifying that there is no letter directly preceding the current point in the match, being effectively a replacement for \b at the start of the pattern. Likewise (?!\p{L}) can be used for the \b at the end of the pattern. When writing them like this you have much more control about what you consider suitable "boundaries" for the things you're looking for, e.g. maybe you want to find foo only when it's bounded by whitespace: (?<![^\S])foo(?![^\S]) (note the double negative here, because the lookahead and lookbehind can only be negative so they work also at the start and end of the string).

Why is this word boundary regex not matching

. is not a word character. \b is checking word boundaries, i.e. boundaries between word and characters not considered to be part of words. Therefore you cannot expect . to be inside the "word" 1. because these two characters do not form a word.


Quick reference document describes \b as:

The match must occur on a boundary between a \w (alphanumeric) and a \W (nonalphanumeric) character.

And \w is described as:

Matches any word character.

If you check what a Word character is, you will find it includes Unicode classes Ll [Letter, Lowercase];
Lu [Letter, Uppercase];
Lt [Letter, Titlecase];
Lo [Letter, Other];
Lm [Letter, Modifier];
Mn [Mark, Nonspacing];
Nd [Number, Decimal Digit] and
Pc [Punctuation, Connector].

But . has Unicode class Po [Punctuation, Other] which is not listed above.

So if you expect \b to match a word boundary in 1., it is right between 1 and .. This answers your question Why.

Note: .NET regex expressions should be preferably tested on testing sites dedicated to them like for example Regex Storm. If you test your regex using PCRE regex flavour (like on the site you linked), you can get different results from .NET.

regular expressions with word boundaries fail in .NET Regex

Replace

Regex regFail = new Regex(@"\b§pattern§\b");

with

Regex regFail = new Regex(@"§\bpattern\b§");

§ is a non-word character, thus, \b prevents pattern from being matched. Perhaps, you do not even need the \b here since the pattern is already inside the non-word characters?

Regex regFail = new Regex(@"§pattern§");

.Net Regular Expression matching the string C#

The \b does not match between the pound sign and a space because they both match non word characters but is does match between the pound sign and the d char.

Instead of a second word boundary \b, you could assert that what is on the right is not a non-whitspace \S character using a negative lookahead (?!:

\bC#(?!\S)

Regex demo

As pointed out in the comments by @elgonzo, to prevent breaking the match when a non word char follows C#, you could use a positive lookahead to assert what is on the right is either a non word char \W or assert the end of the string $

\bC#(?=\W|$)

Regex demo

Regex to match a string which does not contain a specific word next to the match string

I want regex which does not contain not(in first string), I want to match only 2nd string.

That means you should check if the This is... pattern is not followed by newline sequence + spaces* + not as a whole word with backtracking disabled. We can disable backtracking using atomic group in .NET:

(?>This\s+is(?:\s+\d+)+ *)(?![\r\n]+\p{Zs}*not\b)

See the regex demo

Part 1 of the regex This\s+is(?:\s+\d+)+ * matches This is followed with one or more sequences of one or more whitespaces followed with one or more digits, then followed with zero or more spaces. The (?>...) prevent backtracking inside this part of the pattern. The lookahead (?![\r\n]+\p{Zs}*not\b) fails the match if the previously matched text is followed with the whitespaces followed with a whole word not (where \b stands for a word boundary).

How would I write a regular expression to match numeric or alphanumeric words, but not words without numbers?

You may use

(?xi)                # Enable free-spacing and case insensitive mode
\b # Word boundary
(?=[A-Z.]*[0-9]) # After any 0+ letters/dots there must be a digit
[A-Z0-9]+ # 1+ letters or digits
(?:\.[A-Z0-9]+)* # 0+ repetitions of a . and then 1+ letters/digits
\b # Word boundary

See the regex demo at regex101.com and a .NET regex demo showing it really works in a .NET environment.

In C# code, you may use

var Pattern = new Regex(@"
\b # Word boundary
(?=[A-Z.]*[0-9]) # After any 0+ letters/dots there must be a digit
[A-Z0-9]+ # 1+ letters or digits
(?:\.[A-Z0-9]+)* # 0+ repetitions of a . and then 1+ letters/digits
\b # Word boundary",
RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace);

where (?x) = RegexOptions.IgnorePatternWhitespace and (?i) = RegexOptions.IgnoreCase.

Regex: Match word not containing

Your ^((?!Drive).)*$ did not work at all because you tested against a multiline input.

You should use /m modifier to see what the regex matches. It just matches lines that do not contain Drive, but that tempered greedy token does not check if EFI is inside the string.

Actually, the $ anchor is redundant here since .* matches any zero or more characters other than line break characters. You may simply remove it from your pattern.

(NOTE: In .NET, you will need to use [^\r\n]* instead of .* since . in a .NET pattern matches any char but a newline, LF, char, and matches all other line break chars, like a carriage return, CR, etc.).

Use something like

^(?!.*Drive).*EFI.*

Or, if you need to only fail the match if a Drive is present as a whole word:

^(?!.*\bDrive\b).*EFI.*

Or, if there are more words you want to signal the failure with:

^(?!.*(?:Drive|SomethingElse)).*EFI.*
^(?!.*\b(?:Drive|SomethingElse)\b).*EFI.*

See regex demo

Here,

  • ^ - matches start of string
  • (?!.*Drive) - makes sure there is no "Drive" in the string (so, Drives are NOT allowed)
  • (?!.*\bDrive\b) - makes sure there is no "Drive" as a whole word in the string (so, Drives are allowed)
  • .* - any 0+ chars other than line break chars, as many as possible
  • EFI - anEFI substring
  • .* - any 0+ chars other than line break chars, as many as possible.

If your string has newlines, either use a /s dotall modifier or replace . with [\s\S].



Related Topics



Leave a reply



Submit