Word boundary with words starting or ending with special characters gives unexpected results
See what a word boundary matches:
A word boundary can occur in one of three positions:
- Before the first character in the string, if the first character is a word character.
- After the last character in the string, if the last character is a word character.
- Between two characters in the string, where one is a word character and the other is not a word character.
In your pattern }\b
only matches if there is a word char after }
(a letter, digit or _
).
When you use (\W|$)
you require a non-word or end of string explicitly.
A solution is adaptive word boundaries:
re.search(r'(?:(?!\w)|\b(?=\w)){}(?:(?<=\w)\b|(?<!\w))'.format(re.escape('Sortes\index[persons]{Sortes}')), 'test Sortes\index[persons]{Sortes} test')
Or equivalent:
re.search(r'(?!\B\w){}(?<!\w\B)'.format(re.escape('Sortes\index[persons]{Sortes}')), 'test Sortes\index[persons]{Sortes} test')
Here, adaptive dynamic word boundaries are used that mean the following:
(?:(?!\w)|\b(?=\w))
(equal to(?!\B\w)
) - a left-hand boundary, making sure the current position is at the word boundary if the next char is a word char, or no context restriction is applied if the next char is not a word char (note that you will need to use(?:\B(?!\w)|\b(?=\w))
if you want to disallow a word char immediately on the left if the next char is not a word char)(?:(?<=\w)\b|(?<!\w))
(equal to(?<!\w\B)
) - a right-hand boundary, making sure the current position is at the word boundary if the previous char is a word char, or no context restriction is applied if the previous char is not a word char (note that you will need to use(?:(?<=\w)\b|\B(?<!\w))
if you want to disallow a word char immediately on the right if the preceding char is not a word char).
You might also consider using unambiguous word boundaries based on negative lookarounds in these cases:
re.search(r'(?<!\w){}(?!\w)'.format(re.escape('Sortes\index[persons]{Sortes}')), 'test Sortes\index[persons]{Sortes} test')
Here, (?<!\w)
negative lookbehind will fail the match if there is a word char immediately to the left of the current location, and (?!\w)
negative lookahead will fail the match if there is a word char immediately to the right of the current location.
Which to choose? Adaptive word boundaries are more lenient compared to unambiguous word boundaries as the latter presume there must be no word chars on both ends of a match, while the former allow matching leading and trailing non-word chars in any context.
Note: It is easy to customize these lookaround patterns further (say, to only fail the match if there are letters around the pattern, use [^\W\d_]
instead of \w
, or if you only allow matches around whitespaces, use whitespace boundaries (?<!\S)
/ (?!\S)
lookaround boundaries).
Word boundaries not matching when the word starts or ends with special character like square brackets
You must account for two things here:
- Special characters must be escaped with a literal
\
symbol that is best done usingRegex.Escape
method when you have dynamic literal text passed as a variable to regex - It is not possible to rely on word boundaries,
\b
, because the meaning of this construct depends on the immediate context.
You can use dynamic adaptive word boundaries (see my YT video about these word boundaries):
string input= "This is [test] version of application.";
string key = "[test]";
string stringtoFind = $@"(?!\B\w){Regex.Escape(key)}(?<!\w\B)";
Console.WriteLine(Regex.Replace(input, stringtoFind, "1.0"));
You may also use Regex.Escape
with unambiguous word boundaries (?<!\w)
and (?!\w)
:
string input= "This is [test] version of application.";
string key = "[test]";
string stringtoFind = $@"(?<!\w){Regex.Escape(key)}(?!\w)";
Console.WriteLine(Regex.Replace(input, stringtoFind, "1.0"));
Note that if you want to replace a key string when it is enclosed with whitespaces use
string stringtoFind = $@"(?<!\S){Regex.Escape(key)}(?!\S)";
^^^^^^ ^^^^^
How can I use \b boundary around special characters
You can use the pattern:
(?<!\w)✅(?!\w)
This uses negative lookarounds to match an emoji with no word characters on either side.
The reason for the matches you asked about is that \b
is a zero-width boundary where one side of the boundary is \w
(a word character, or [0-9A-Za-z_]
) and the other is the beginning or end of the string or \W
(a non-word character).
For example, consider the string "foo."
:
start of string boundary (zero width)
|
| non-word character
| |
v v
foo.
^ ^
| |
word characters
The \b
boundary could be used in the regex \bfoo\b
and find a match thanks to the boundary between o
and .
characters and the boundary between the beginning of the string and the character f
.
"foobar"
does not match \bfoo\b
because the second o
and b
don't satisfy the boundary condition, that is, b
isn't a non-word character or end of the string.
The pattern \b-\b
does not match the string "-"
because "-"
isn't a word character. Likewise, emojis are built from non-word characters so they won't respond to the boundary as a word character does as is the case with \bfoo\b
.
Match star * character at end of word boundary \b
The *
is not a word character thus no mach, if followed by a \b and a non word character.
Assuming the initial word boundary is fine but you want to match sh*t
but not sh*t*
or match f***!
but not f***a
how about simulating your own word boundary by use of a negative lookahead.
\b(...)(?![\w*])
See this demo at regex101
If needed, the opening word boundary \b
can be replaced by a negative lookbehind: (?<![\w*])
How to test for word boundries, if the pattern starts or ends with punctuation?
You can use adaptive dynamic word boundaries:
// found in Mozilla's RegExp guide.
function escapeRegExp(str) {
return str.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
}
let msg = "a b c !test1 d e f";
let cmd = "!test1";
let re = new RegExp("(?!\\B\\w)" + escapeRegExp(cmd) + "(?<!\\w\\B)");
// console.log(re.source);// => (?!\B\w)!test1(?<!\w\B)
console.log(`re: ${re.test(msg)}`);
// => re: true
MySQL 8.0.30 Regular Expression Word Matching with Special Characters
You need to escape special chars in the search phrase and use the construct that I call "adaptive dynamic word boundaries" instead of word boundaries:
var_text REGEXP CONCAT('(?!\\B\\w)',REGEXP_REPLACE(sk.keyword, '([-.^$*+?()\\[\\]{}\\\\|])', '\\$1'),'(?<!\\w\\B)')
The REGEXP_REPLACE(sk.keyword, '([-.^$*+?()\\[\\]{}\\\\|])', '\\$1')
matches . ^ $ * + - ? ( ) [ ] { } \ |
chars (adds a \
before them) and (?!\\B\\w)
/ (?<!\\w\\B)
require word boundaries only when the search phrase start/ends with a word char.
More details on adaptive dynamic word boundaries and demo in my YT video.
Regex to detect string including special characters
You may use this regex for this using different flavors of word boundaries:
\bc\+\+\B
RegEx Demo
RegEx Details:
\b
: Word boundary between a non-word and word characterc\+\+
: Matchc++
\B
: Inverse of word boundary to match where\b
doesn't match
Python Code:
>>> import re
>>> s1 = 'I am using c++ programming'
>>> s2 = 'I am usingc++ programming'
>>> rx = re.compile(r'\bc\+\+\B')
>>> print (rx.findall(s1))
['c++']
>>> print (rx.findall(s2))
[]
>>>
Word boundary with words starting or ending with special characters gives unexpected results
See what a word boundary matches:
A word boundary can occur in one of three positions:
- Before the first character in the string, if the first character is a word character.
- After the last character in the string, if the last character is a word character.
- Between two characters in the string, where one is a word character and the other is not a word character.
In your pattern }\b
only matches if there is a word char after }
(a letter, digit or _
).
When you use (\W|$)
you require a non-word or end of string explicitly.
A solution is adaptive word boundaries:
re.search(r'(?:(?!\w)|\b(?=\w)){}(?:(?<=\w)\b|(?<!\w))'.format(re.escape('Sortes\index[persons]{Sortes}')), 'test Sortes\index[persons]{Sortes} test')
Or equivalent:
re.search(r'(?!\B\w){}(?<!\w\B)'.format(re.escape('Sortes\index[persons]{Sortes}')), 'test Sortes\index[persons]{Sortes} test')
Here, adaptive dynamic word boundaries are used that mean the following:
(?:(?!\w)|\b(?=\w))
(equal to(?!\B\w)
) - a left-hand boundary, making sure the current position is at the word boundary if the next char is a word char, or no context restriction is applied if the next char is not a word char (note that you will need to use(?:\B(?!\w)|\b(?=\w))
if you want to disallow a word char immediately on the left if the next char is not a word char)(?:(?<=\w)\b|(?<!\w))
(equal to(?<!\w\B)
) - a right-hand boundary, making sure the current position is at the word boundary if the previous char is a word char, or no context restriction is applied if the previous char is not a word char (note that you will need to use(?:(?<=\w)\b|\B(?<!\w))
if you want to disallow a word char immediately on the right if the preceding char is not a word char).
You might also consider using unambiguous word boundaries based on negative lookarounds in these cases:
re.search(r'(?<!\w){}(?!\w)'.format(re.escape('Sortes\index[persons]{Sortes}')), 'test Sortes\index[persons]{Sortes} test')
Here, (?<!\w)
negative lookbehind will fail the match if there is a word char immediately to the left of the current location, and (?!\w)
negative lookahead will fail the match if there is a word char immediately to the right of the current location.
Which to choose? Adaptive word boundaries are more lenient compared to unambiguous word boundaries as the latter presume there must be no word chars on both ends of a match, while the former allow matching leading and trailing non-word chars in any context.
Note: It is easy to customize these lookaround patterns further (say, to only fail the match if there are letters around the pattern, use [^\W\d_]
instead of \w
, or if you only allow matches around whitespaces, use whitespace boundaries (?<!\S)
/ (?!\S)
lookaround boundaries).
Related Topics
C Function Called from Python Via Ctypes Returns Incorrect Value
String Count With Overlapping Occurrences
How to Convert All Strings in a List of Lists to Integers
How to Change the Order of Dataframe Columns
Why Does the Expression 0 ≪ 0 == 0 Return False in Python
How Are Python'S Built in Dictionaries Implemented
How to Replace Nan Values by Zeroes in a Column of a Pandas Dataframe
Do Regular Expressions from the Re Module Support Word Boundaries (\B)
How to Measure Elapsed Time in Python
How to Create a Tuple With Only One Element
String Comparison in Python: Is Vs. ==
What Exactly Is Current Working Directory
How to Check If a String Is a Substring of Items in a List of Strings
Choosing the Correct Upper and Lower Hsv Boundaries For Color Detection With'Cv::Inrange' (Opencv)