JavaScript Regexp + Word Boundaries + Unicode Characters

utf-8 word boundary regex in javascript

The word boundary assertion does only match if a word character is not preceded or followed by another word character (so .\b. is equal to \W\w and \w\W). And \w is defined as [A-Za-z0-9_]. So \w doesn’t match greek characters. And thus you cannot use \b for this case.

What you could do instead is to use this:

"αβ αβγ γαβ αβ αβ".replace(/(^|\s)αβ(?=\s|$)/g, "$1AB")

RegExp word boundary with special characters (.) javascript

You can check for the word boundary first (as you were doing), but the tricky part is at the end where you can't use the word boundary because of the .. However, you can check for a whitespace character at the end instead:

/\b(u\.s\.a\.)(?:\s|$)/gi

Check out the Regex101

Javascript regex with word boundary includes word with special characters

\b only works for ascii, you have to use unicode properties to handle non-ascii word boundaries, for example:

const nodes = [{
textContent: "Ford is the best"
}, {
textContent: "Fordørgen is the best"
}];

const variable = 'Ford';
const regex = new RegExp('(?<!\\p{Alpha})' + variable + '(?!\\p{Alpha})', 'u');

const matches = nodes.filter(function(node) {
return regex.test(node.textContent);
});

console.log(matches);

Regular Expression Word Boundary and Special Characters

\b is a zero-width assertion: it doesn't consume any characters, it just asserts that a certain condition holds at a given position. A word boundary asserts that the position is either preceded by a word character and not followed by one, or followed by a word character and not preceded by one. (A "word character" is a letter, a digit, or an underscore.) In your string:

add +

...there's a word boundary at the beginning because the a is not preceded by a word character, and there's one after the second d because it's not followed by a word character. The \b in your regex (/\b\+/) is trying to match between the space and the +, which doesn't work because neither of those is a word character.

Javascript Regex Word Boundary with optional non-word character

You need to account for 3 things here:

  • The main point is that a \b word boundary is a context-dependent construct, and if your input is not always alphanumeric-only, you need unambiguous word boundaries
  • You need to double escape special chars inside constructor RegExp notation
  • As you pass a variable to a regex, you need to make sure all special chars are properly escaped.

Use

let userStr = 'why hello there, or should I say #hello there?';let keyword = '#hello';let re_pattern = `(?:^|\\W)(${keyword.replace(/[-\/\\^$*+?.()|[\]{}]/g, '\\$&')})(?!\\w)`;let res = [], m;
// To find a single (first) matchconsole.log((m=new RegExp(re_pattern).exec(userStr)) ? m[1] : "");
// To find multiple matches:let rx = new RegExp(re_pattern, "g");while (m=rx.exec(userStr)) { res.push(m[1]);}console.log(res);


Related Topics



Leave a reply



Submit