Word Boundary With Words Starting or Ending With Special Characters Gives Unexpected Results

Word boundary with words starting or ending with special characters gives unexpected results

See what a word boundary matches:

A word boundary can occur in one of three positions:

  • Before the first character in the string, if the first character is a word character.
  • After the last character in the string, if the last character is a word character.
  • Between two characters in the string, where one is a word character and the other is not a word character.

In your pattern }\b only matches if there is a word char after } (a letter, digit or _).

When you use (\W|$) you require a non-word or end of string explicitly.

A solution is adaptive word boundaries:

re.search(r'(?:(?!\w)|\b(?=\w)){}(?:(?<=\w)\b|(?<!\w))'.format(re.escape('Sortes\index[persons]{Sortes}')), 'test Sortes\index[persons]{Sortes} test')

Or equivalent:

re.search(r'(?!\B\w){}(?<!\w\B)'.format(re.escape('Sortes\index[persons]{Sortes}')), 'test Sortes\index[persons]{Sortes} test')

Here, adaptive dynamic word boundaries are used that mean the following:

  • (?:(?!\w)|\b(?=\w)) (equal to (?!\B\w)) - a left-hand boundary, making sure the current position is at the word boundary if the next char is a word char, or no context restriction is applied if the next char is not a word char (note that you will need to use (?:\B(?!\w)|\b(?=\w)) if you want to disallow a word char immediately on the left if the next char is not a word char)
  • (?:(?<=\w)\b|(?<!\w)) (equal to (?<!\w\B)) - a right-hand boundary, making sure the current position is at the word boundary if the previous char is a word char, or no context restriction is applied if the previous char is not a word char (note that you will need to use (?:(?<=\w)\b|\B(?<!\w)) if you want to disallow a word char immediately on the right if the preceding char is not a word char).

You might also consider using unambiguous word boundaries based on negative lookarounds in these cases:

re.search(r'(?<!\w){}(?!\w)'.format(re.escape('Sortes\index[persons]{Sortes}')), 'test Sortes\index[persons]{Sortes} test')

Here, (?<!\w) negative lookbehind will fail the match if there is a word char immediately to the left of the current location, and (?!\w) negative lookahead will fail the match if there is a word char immediately to the right of the current location.

Which to choose? Adaptive word boundaries are more lenient compared to unambiguous word boundaries as the latter presume there must be no word chars on both ends of a match, while the former allow matching leading and trailing non-word chars in any context.

Note: It is easy to customize these lookaround patterns further (say, to only fail the match if there are letters around the pattern, use [^\W\d_] instead of \w, or if you only allow matches around whitespaces, use whitespace boundaries (?<!\S) / (?!\S) lookaround boundaries).

Word boundaries not matching when the word starts or ends with special character like square brackets

You must account for two things here:

  • Special characters must be escaped with a literal \ symbol that is best done using Regex.Escape method when you have dynamic literal text passed as a variable to regex
  • It is not possible to rely on word boundaries, \b, because the meaning of this construct depends on the immediate context.

You can use dynamic adaptive word boundaries (see my YT video about these word boundaries):

string input= "This is [test] version of application.";
string key = "[test]";
string stringtoFind = $@"(?!\B\w){Regex.Escape(key)}(?<!\w\B)";
Console.WriteLine(Regex.Replace(input, stringtoFind, "1.0"));

You may also use Regex.Escape with unambiguous word boundaries (?<!\w) and (?!\w):

string input= "This is [test] version of application.";
string key = "[test]";
string stringtoFind = $@"(?<!\w){Regex.Escape(key)}(?!\w)";
Console.WriteLine(Regex.Replace(input, stringtoFind, "1.0"));

Note that if you want to replace a key string when it is enclosed with whitespaces use

string stringtoFind = $@"(?<!\S){Regex.Escape(key)}(?!\S)";
^^^^^^ ^^^^^

How can I use \b boundary around special characters

You can use the pattern:

(?<!\w)✅(?!\w) 

This uses negative lookarounds to match an emoji with no word characters on either side.

The reason for the matches you asked about is that \b is a zero-width boundary where one side of the boundary is \w (a word character, or [0-9A-Za-z_]) and the other is the beginning or end of the string or \W (a non-word character).

For example, consider the string "foo.":

start of string boundary (zero width)
|
| non-word character
| |
v v
foo.
^ ^
| |
word characters

The \b boundary could be used in the regex \bfoo\b and find a match thanks to the boundary between o and . characters and the boundary between the beginning of the string and the character f.

"foobar" does not match \bfoo\b because the second o and b don't satisfy the boundary condition, that is, b isn't a non-word character or end of the string.

The pattern \b-\b does not match the string "-" because "-" isn't a word character. Likewise, emojis are built from non-word characters so they won't respond to the boundary as a word character does as is the case with \bfoo\b.

Match star * character at end of word boundary \b

The * is not a word character thus no mach, if followed by a \b and a non word character.

Assuming the initial word boundary is fine but you want to match sh*t but not sh*t* or match f***! but not f***a how about simulating your own word boundary by use of a negative lookahead.

\b(...)(?![\w*])

See this demo at regex101

If needed, the opening word boundary \b can be replaced by a negative lookbehind: (?<![\w*])

How to test for word boundries, if the pattern starts or ends with punctuation?

You can use adaptive dynamic word boundaries:

// found in Mozilla's RegExp guide.
function escapeRegExp(str) {
return str.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
}

let msg = "a b c !test1 d e f";
let cmd = "!test1";

let re = new RegExp("(?!\\B\\w)" + escapeRegExp(cmd) + "(?<!\\w\\B)");
// console.log(re.source);// => (?!\B\w)!test1(?<!\w\B)
console.log(`re: ${re.test(msg)}`);
// => re: true

MySQL 8.0.30 Regular Expression Word Matching with Special Characters

You need to escape special chars in the search phrase and use the construct that I call "adaptive dynamic word boundaries" instead of word boundaries:

var_text REGEXP CONCAT('(?!\\B\\w)',REGEXP_REPLACE(sk.keyword, '([-.^$*+?()\\[\\]{}\\\\|])', '\\$1'),'(?<!\\w\\B)')

The REGEXP_REPLACE(sk.keyword, '([-.^$*+?()\\[\\]{}\\\\|])', '\\$1') matches . ^ $ * + - ? ( ) [ ] { } \ | chars (adds a \ before them) and (?!\\B\\w) / (?<!\\w\\B) require word boundaries only when the search phrase start/ends with a word char.

More details on adaptive dynamic word boundaries and demo in my YT video.

Regex to detect string including special characters

You may use this regex for this using different flavors of word boundaries:

\bc\+\+\B

RegEx Demo

RegEx Details:

  • \b: Word boundary between a non-word and word character
  • c\+\+: Match c++
  • \B: Inverse of word boundary to match where \b doesn't match

Python Code:

>>> import re
>>> s1 = 'I am using c++ programming'
>>> s2 = 'I am usingc++ programming'
>>> rx = re.compile(r'\bc\+\+\B')
>>> print (rx.findall(s1))
['c++']
>>> print (rx.findall(s2))
[]
>>>

Word boundary with words starting or ending with special characters gives unexpected results

See what a word boundary matches:

A word boundary can occur in one of three positions:

  • Before the first character in the string, if the first character is a word character.
  • After the last character in the string, if the last character is a word character.
  • Between two characters in the string, where one is a word character and the other is not a word character.

In your pattern }\b only matches if there is a word char after } (a letter, digit or _).

When you use (\W|$) you require a non-word or end of string explicitly.

A solution is adaptive word boundaries:

re.search(r'(?:(?!\w)|\b(?=\w)){}(?:(?<=\w)\b|(?<!\w))'.format(re.escape('Sortes\index[persons]{Sortes}')), 'test Sortes\index[persons]{Sortes} test')

Or equivalent:

re.search(r'(?!\B\w){}(?<!\w\B)'.format(re.escape('Sortes\index[persons]{Sortes}')), 'test Sortes\index[persons]{Sortes} test')

Here, adaptive dynamic word boundaries are used that mean the following:

  • (?:(?!\w)|\b(?=\w)) (equal to (?!\B\w)) - a left-hand boundary, making sure the current position is at the word boundary if the next char is a word char, or no context restriction is applied if the next char is not a word char (note that you will need to use (?:\B(?!\w)|\b(?=\w)) if you want to disallow a word char immediately on the left if the next char is not a word char)
  • (?:(?<=\w)\b|(?<!\w)) (equal to (?<!\w\B)) - a right-hand boundary, making sure the current position is at the word boundary if the previous char is a word char, or no context restriction is applied if the previous char is not a word char (note that you will need to use (?:(?<=\w)\b|\B(?<!\w)) if you want to disallow a word char immediately on the right if the preceding char is not a word char).

You might also consider using unambiguous word boundaries based on negative lookarounds in these cases:

re.search(r'(?<!\w){}(?!\w)'.format(re.escape('Sortes\index[persons]{Sortes}')), 'test Sortes\index[persons]{Sortes} test')

Here, (?<!\w) negative lookbehind will fail the match if there is a word char immediately to the left of the current location, and (?!\w) negative lookahead will fail the match if there is a word char immediately to the right of the current location.

Which to choose? Adaptive word boundaries are more lenient compared to unambiguous word boundaries as the latter presume there must be no word chars on both ends of a match, while the former allow matching leading and trailing non-word chars in any context.

Note: It is easy to customize these lookaround patterns further (say, to only fail the match if there are letters around the pattern, use [^\W\d_] instead of \w, or if you only allow matches around whitespaces, use whitespace boundaries (?<!\S) / (?!\S) lookaround boundaries).



Related Topics



Leave a reply



Submit