No \p{L} for JavaScript Regex ? Use Unicode in JS regex
What you need to add is a subset of what you asked for. First you should define what set of characters you need. \pL
means every letter from every language.
It's kind of ugly but doesn't affect performance and rather the best solution to get around such kind of problems in JS. ECMA2018 has a support for \pL
but way far to be implemented by all major browsers.
If it's a personal taste, you could reduce this ugliness a bit:
var characterSet = 'a-zA-ZáàâäãåçéèêëíìîïñóòôöõúùûüýÿæœÁÀÂÄÃÅÇÉÈÊËÍÌÎÏÑÓÒÔÖÕÚÙÛÜÝŸÆŒ';
var re = new RegExp('[' + characterSet + ']' + '[' + characterSet + '\' ,"-]*' + '[' + characterSet + '\'",]+');
This update credits go to @Francesco:
var pCL = 'a-zA-ZáàâäãåçéèêëíìîïñóòôöõúùûüýÿæœÁÀÂÄÃÅÇÉÈÊËÍÌÎÏÑÓÒÔÖÕÚÙÛÜÝŸÆŒ';
var re = new RegExp(`[${pCL}][${pCL}' ,"-]*[${pCL}'",]+`);
console.log(re.source);
Can I use the unicode flag within a JSON schema pattern (regular expression)?
The 2020-12 version of JSON Schema (which you reference) has an external more detailed changelog (informative), which details the following which may not be obvious from the specification itself...
Regular expressions are now expected (but not strictly required) to
support unicode characters. Previously, this was unspecified and
implementations may or may not support this unicode in regular
expressions. - https://json-schema.org/draft/2020-12/release-notes.html
If you are using an implementation which supports JSON Schema draft 2020-12, you should be able to use unicode in regex, as that flag should be enabled.
You cannot specify flags with the regular expression because the actual requirements for regular expression support are only SHOULD and not MUST. In the specification world, this means you cannot rely on this to be interoperable. If you only plan to use the schemas internally and you test it and it works (it should given it sounds like you're working with js/node), then you'll probably be OK, but sharing the schemas to others may not work as expected.
Some implementations in other languages use a port of the ECMA-262 regular expression engine, but not all do, and sometimes there isn't a port avilable.
What's the correct regex range for javascript's regexes to match all the non word characters in any script?
Generic solution
Mathias Bynens suggests to follow the UTS18 recommendation and thus a Unicode-aware \W
will look like:
[^\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]
Please note the comment for the suggested Unicode property class combination:
This is only an approximation to Word Boundaries (see b below). The
Connector Punctuation is added in for programming language
identifiers, thus adding "_" and similar characters.
More considerations
The \w
construct (and thus its \W
counterpart), when matching in a Unicode-aware context, matches similar, but somewhat different set of characters across regex engines.
For example, here is Non-word character: \W
.NET definition: [^\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Mn}\p{Pc}\p{Lm}]
, where \p{Ll}\p{Lu}\p{Lt}\p{Lo}
can be contracted to a sheer \p{L}
and the pattern is thus equal to [^\p{L}\p{Nd}\p{Mn}\p{Pc}]
.
In Android (see documentation), [^\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}\p{IsJoin_Control}]
, where \p{gc=Mn}\p{gc=Me}\p{gc=Mc}
can be just written as \p{M}
.
In PHP PCRE, \W
matches [^\p{L}\p{N}_]
.
Rexegg cheat sheet defines Python 3 \w
as "Unicode letter, ideogram, digit, or underscore", i.e. [\p{L}\p{Mn}\p{Nd}_]
.
You may roughly decompose \W
as [^\p{L}\p{N}\p{M}\p{Pc}]
:
/[^\p{L}\p{N}\p{M}\p{Pc}]/gu
where
[^
- is the start of the negated character class that matches a single char other than:\p{L}
- any Unicode letter\p{N}
- any Unicode digit\p{M}
- a diacritic mark\p{Pc}
- a connector punctuation symbol
]
- end of the character class.
Note it is \p{Pc}
class that matches an underscore.
NOTE that \p{Alphabetic}
(\p{Alpha}
) includes all letters matched by \p{L}
, plus letter numbers matched by \p{Nl}
(e.g. Ⅻ
– a character for the roman number 12
), plus some other symbols matched with \p{Other_Alphabetic}
(\p{OAlpha}
).
Other variations:
/[^\p{L}0-9_]/gu
- to just use\W
that is aware of Unicode letters only/[^\p{L}\p{N}_]/gu
- (PCRE\W
style) to just use\W
that is aware of Unicode letters and digits only.
Note that Java's (?U)\W
will match a mix of what \W
matches in PCRE, Python and .NET.
Match only unicode letters
Starting with ECMAScript 2018, JavaScript finally supports Unicode property escapes natively.
For older versions, you either need to define all the relevant Unicode ranges yourself. Or you can use Steven Levithan's XRegExp
package with Unicode add-ons and utilize its Unicode property shortcuts:
var regex = new XRegExp("^\\p{L}*$")
var a = "abcäöüéèê"
if (regex.test(a)) {
// Match
} else {
// No Match
}
Related Topics
Why Does a Regexp With Global Flag Give Wrong Results
Add a Property to a JavaScript Object Using a Variable as the Name
How to Calculate Number of Days Between Two Dates
How to Encode a String to Base64 in JavaScript
What Does "This" Refer to in Arrow Functions in Es6
JavaScript Object Bracket Notation ({ Navigation } =) on Left Side of Assign
How to Check If an Object Has a Specific Property in JavaScript
How to Clone an Array of Objects in JavaScript
JavaScript Get Clipboard Data on Paste Event (Cross Browser)
Difference Between Null and Undefined in JavaScript
Parsing a String to a Date in JavaScript
How to Append Something to an Array