Utf-8 Word Boundary Regex in JavaScript

utf-8 word boundary regex in javascript

The word boundary assertion does only match if a word character is not preceded or followed by another word character (so .\b. is equal to \W\w and \w\W). And \w is defined as [A-Za-z0-9_]. So \w doesn’t match greek characters. And thus you cannot use \b for this case.

What you could do instead is to use this:

"αβ αβγ γαβ αβ αβ".replace(/(^|\s)αβ(?=\s|$)/g, "$1AB")

Javascript RegEx UTF-8

Again: I don't know if this is the answer you are looking for. This will also recapitalize the first letter of the name. So if I'm writing "My name is Salvador Dalí" the answer is: "Hello, Salvador Dalí! Nice to meet you!"

var myInput = document.getElementById("myInput");
function myFunction() {  var text,    answer = myInput.value.toLowerCase();  answer = answer.replace("my name is ", "");
  switch (answer) {    case "":      text = "Please type something.";      break;    default:      text = "Hello, " + CapitalizeName(answer) + "! Nice to meet you!";  }  document.getElementById("reply").innerHTML = text;}
function CapitalizeName(name) {  let _array = name.split(" ");  let n_array = [];  _array.map(w => {    w = w.charAt(0).toUpperCase() + w.slice(1);    n_array.push(w);  });  return n_array.join(" ");}

<p>What is your name?</p>
<input id="myInput" type="text">
<button onclick="myFunction()">Go</button>
<p id="reply"></p>

Javascript - regex - word boundary (\b) issue

Since Javascript doesn't have the lookbehind feature and since word boundaries work only with members of the \w character class, the only way is to use groups (and capturing groups if you want to make a replacement):

(?m)(^|[^a-zA-ZΆΈ-ώἀ-ῼ\n])([a-zA-ZΆΈ-ώἀ-ῼ]{2})(?![a-zA-ZΆΈ-ώἀ-ῼ])

example to remove 2 letters words:

txt = txt.replace(/(^|[^a-zA-ZΆΈ-ώἀ-ῼ\n])([a-zA-ZΆΈ-ώἀ-ῼ]{2})(?![a-zA-ZΆΈ-ώἀ-ῼ])/gm, '\1');

match hebrew character at word boundary via regex in javascript?

I can't read Hebrew... does this regex do what you want?

/(\S*[\u05D0]+\S*)/g

Your first regex, /(\u05D0+)/g matches on only the character you are interested in.

Your second regex, /(\u05D0)\b/g, matches only when the character you are interested in is the last-only (or last-repeated) character before a word boundary...so that doesn't won't match that character in the beginning or middle of a word.

EDIT:

Look at this anwer

utf-8 word boundary regex in javascript

Using the info from that answer, I come up with this regex, is this correct?

/([\u05D0])(?=\s|$)/g

Regex wordwrap with UTF8 characters in JS

The problem is that JavaScript recognizes word boundaries only before/after ASCII letters (and numbers/underscore). Just drop the \b anchors and it should work.

result = subject.replace(/[a-zA-Z0-9ßÄÖÜäöüÑñÉéÈèÁáÀàÂâŶĈĉĜĝŷÊêÔôÛûŴŵ-]+/g, "<span>$&</span>");

Utf-8 Word Boundary Regex in JavaScript