How to Use Unicode-Aware Regular Expressions in JavaScript

No \p{L} for JavaScript Regex ? Use Unicode in JS regex

What you need to add is a subset of what you asked for. First you should define what set of characters you need. \pL means every letter from every language.

It's kind of ugly but doesn't affect performance and rather the best solution to get around such kind of problems in JS. ECMA2018 has a support for \pL but way far to be implemented by all major browsers.

If it's a personal taste, you could reduce this ugliness a bit:

var characterSet = 'a-zA-ZáàâäãåçéèêëíìîïñóòôöõúùûüýÿæœÁÀÂÄÃÅÇÉÈÊËÍÌÎÏÑÓÒÔÖÕÚÙÛÜÝŸÆŒ';
var re = new RegExp('[' + characterSet + ']' + '[' + characterSet + '\' ,"-]*' + '[' + characterSet + '\'",]+');

This update credits go to @Francesco:

var pCL = 'a-zA-ZáàâäãåçéèêëíìîïñóòôöõúùûüýÿæœÁÀÂÄÃÅÇÉÈÊËÍÌÎÏÑÓÒÔÖÕÚÙÛÜÝŸÆŒ';

var re = new RegExp(`[${pCL}][${pCL}' ,"-]*[${pCL}'",]+`);

console.log(re.source);

Can I use the unicode flag within a JSON schema pattern (regular expression)?

The 2020-12 version of JSON Schema (which you reference) has an external more detailed changelog (informative), which details the following which may not be obvious from the specification itself...

Regular expressions are now expected (but not strictly required) to
support unicode characters. Previously, this was unspecified and
implementations may or may not support this unicode in regular
expressions. - https://json-schema.org/draft/2020-12/release-notes.html

If you are using an implementation which supports JSON Schema draft 2020-12, you should be able to use unicode in regex, as that flag should be enabled.

You cannot specify flags with the regular expression because the actual requirements for regular expression support are only SHOULD and not MUST. In the specification world, this means you cannot rely on this to be interoperable. If you only plan to use the schemas internally and you test it and it works (it should given it sounds like you're working with js/node), then you'll probably be OK, but sharing the schemas to others may not work as expected.

Some implementations in other languages use a port of the ECMA-262 regular expression engine, but not all do, and sometimes there isn't a port avilable.

What's the correct regex range for javascript's regexes to match all the non word characters in any script?

Generic solution

Mathias Bynens suggests to follow the UTS18 recommendation and thus a Unicode-aware \W will look like:

[^\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]

Please note the comment for the suggested Unicode property class combination:

This is only an approximation to Word Boundaries (see b below). The
Connector Punctuation is added in for programming language
identifiers, thus adding "_" and similar characters.

More considerations

The \w construct (and thus its \W counterpart), when matching in a Unicode-aware context, matches similar, but somewhat different set of characters across regex engines.

For example, here is Non-word character: \W .NET definition: [^\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Mn}\p{Pc}\p{Lm}], where \p{Ll}\p{Lu}\p{Lt}\p{Lo} can be contracted to a sheer \p{L} and the pattern is thus equal to [^\p{L}\p{Nd}\p{Mn}\p{Pc}].

In Android (see documentation), [^\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}\p{IsJoin_Control}], where \p{gc=Mn}\p{gc=Me}\p{gc=Mc} can be just written as \p{M}.

In PHP PCRE, \W matches [^\p{L}\p{N}_].

Rexegg cheat sheet defines Python 3 \w as "Unicode letter, ideogram, digit, or underscore", i.e. [\p{L}\p{Mn}\p{Nd}_].

You may roughly decompose \W as [^\p{L}\p{N}\p{M}\p{Pc}]:

/[^\p{L}\p{N}\p{M}\p{Pc}]/gu

where

  • [^ - is the start of the negated character class that matches a single char other than:
    • \p{L} - any Unicode letter
    • \p{N} - any Unicode digit
    • \p{M} - a diacritic mark
    • \p{Pc} - a connector punctuation symbol
  • ] - end of the character class.

Note it is \p{Pc} class that matches an underscore.

NOTE that \p{Alphabetic} (\p{Alpha}) includes all letters matched by \p{L}, plus letter numbers matched by \p{Nl} (e.g. – a character for the roman number 12), plus some other symbols matched with \p{Other_Alphabetic} (\p{OAlpha}).

Other variations:

  • /[^\p{L}0-9_]/gu - to just use \W that is aware of Unicode letters only
  • /[^\p{L}\p{N}_]/gu - (PCRE \W style) to just use \W that is aware of Unicode letters and digits only.

Note that Java's (?U)\W will match a mix of what \W matches in PCRE, Python and .NET.

Match only unicode letters

Starting with ECMAScript 2018, JavaScript finally supports Unicode property escapes natively.

For older versions, you either need to define all the relevant Unicode ranges yourself. Or you can use Steven Levithan's XRegExp package with Unicode add-ons and utilize its Unicode property shortcuts:

var regex = new XRegExp("^\\p{L}*$")
var a = "abcäöüéèê"
if (regex.test(a)) {
// Match
} else {
// No Match
}


Related Topics



Leave a reply



Submit