Concrete JavaScript Regular Expression for Accented Characters (Diacritics)

Concrete JavaScript regular expression for accented characters (diacritics)

The easier way to accept all accents is this:

[A-zÀ-ú] // accepts lowercase and uppercase characters
[A-zÀ-ÿ] // as above, but including letters with an umlaut (includes [ ] ^ \ × ÷)
[A-Za-zÀ-ÿ] // as above but not including [ ] ^ \
[A-Za-zÀ-ÖØ-öø-ÿ] // as above, but not including [ ] ^ \ × ÷

See Unicode Character Table for characters listed in numeric order.

What's a good regex to include accented characters in a simple way?

Accented Characters: DIY Character Range Subtraction

If your regex engine allows it (and many will), this will work:

(?i)^(?:(?![×Þß÷þø])[-'0-9a-zÀ-ÿ])+$

Please see the demo (you can add characters to test).

Explanation

  • (?i) sets case-insensitive mode
  • The ^ anchor asserts that we are at the beginning of the string
  • (?:(?![×Þß÷þø])[-'0-9a-zÀ-ÿ]) matches one character...
  • The lookahead (?![×Þß÷þø]) asserts that the char is not one of those in the brackets
  • [-'0-9a-zÀ-ÿ] allows dash, apostrophe, digits, letters, and chars in a wide accented range, from which we need to subtract
  • The + matches that one or more times
  • The $ anchor asserts that we are at the end of the string

Reference

Extended ASCII Table

Make an accent-insensitive RegExp in JavaScript

There is no RegExp parameter that you can pass to alter the way accents are treated. What you would need to do is build up a matrix of characters that should substitute each other and then construct a RegExp pattern from these substitute characters.

const e = ['È', 'É', 'Ê', 'Ë', 'è', 'é', 'ê', 'ë'],    a = ['à', 'á', 'â', 'ã', 'ä', 'å', 'æ', 'À', 'Á', 'Â', 'Ã', 'Ä', 'Å', 'Æ']
const substitutions = { e, E: e, a, A: a}
var str = 'Cesar';const pattern = Array.from(str).map(c => substitutions[c] ? `[${c}${substitutions[c].join("")}]`: c).join("")console.log(pattern)
var i = new RegExp(pattern, "gi").exec('césar');console.log(i)

Regex for diacritics

As Casimir et Hippolyte stated in comments, Javascript does not support \p{L} unicode character class.

You can create your own character class:

[a-zA-Z0-9À-ž]

Demo

If you want to allow those characters but replace characters outside those ranges, negate the character classes:

[^a-zA-Z0-9À-ž]

Demo

Or as pointed out in comments:

[A-zÀ-ÖØ-öø-įĴ-őŔ-žǍ-ǰǴ-ǵǸ-țȞ-ȟȤ-ȳɃɆ-ɏḀ-ẞƀ-ƓƗ-ƚƝ-ơƤ-ƥƫ-ưƲ-ƶẠ-ỿ]

Regex matching whitespace and accented characters

You can match by unicode range (for unicode values, take a look at this table). Try something like this:

[a-zA-Z\u00C0-\u017F\s]+

Explanation:

  1. a-zA-Z matches that range of lower and uppercase characters.
  2. \u00C0-\u017F matches a chunk of accented characters.
  3. \s matches whitespace.

let nameToCheck = "Lómöwen Thrél"let checkValue = /^[a-zA-Z\u00C0-\u017F\s]+$/.test(nameToCheck);
document.write(checkValue ? "valid name" : "invalid name");

how to replace all accented characters with English equivalents

function Convert(string){
return string.normalize('NFD').replace(/[\u0300-\u036f]/g, '');
}
console.log(Convert("Ë À Ì Â Í Ã Î Ä Ï Ç Ò È Ó É Ô Ê Õ Ö ê Ù ë Ú î Û ï Ü ô Ý õ â "))

Output:

"E A I A I A I A I C O E O E O E O O e U e U i U i U o Y o a "

Concrete JavaScript regular expression for accented characters (diacritics)

The easier way to accept all accents is this:

[A-zÀ-ú] // accepts lowercase and uppercase characters
[A-zÀ-ÿ] // as above, but including letters with an umlaut (includes [ ] ^ \ × ÷)
[A-Za-zÀ-ÿ] // as above but not including [ ] ^ \
[A-Za-zÀ-ÖØ-öø-ÿ] // as above, but not including [ ] ^ \ × ÷

See Unicode Character Table for characters listed in numeric order.

Make exception when replacing diacritics with regular characters

The only simple way I see is not optimized but do the job properly :

const text = "Çééé éÇé àç" // test this string
.replace(/\u00e7/g, '__minC__') // save wanted chars position
.replace(/\u00c7/g, '__majC__')
.normalize('NFD') // normalize to prepare diacritic edit
.replace(/\p{Diacritic}/gu, '') // replace all diacritics
.replace(/__minC__/g, 'ç') // restore wanted chars
.replace(/__majC__/g, 'Ç')

console.log(text)


Related Topics



Leave a reply



Submit