Match Only Unicode Letters

Match only unicode letters

Starting with ECMAScript 2018, JavaScript finally supports Unicode property escapes natively.

For older versions, you either need to define all the relevant Unicode ranges yourself. Or you can use Steven Levithan's XRegExp package with Unicode add-ons and utilize its Unicode property shortcuts:

var regex = new XRegExp("^\\p{L}*$")
var a = "abcäöüéèê"
if (regex.test(a)) {
// Match
} else {
// No Match
}

Match only unicode letters

Starting with ECMAScript 2018, JavaScript finally supports Unicode property escapes natively.

For older versions, you either need to define all the relevant Unicode ranges yourself. Or you can use Steven Levithan's XRegExp package with Unicode add-ons and utilize its Unicode property shortcuts:

var regex = new XRegExp("^\\p{L}*$")
var a = "abcäöüéèê"
if (regex.test(a)) {
// Match
} else {
// No Match
}

Matching only a unicode letter in Python re

You can construct a new character class:

[^\W\d_]

instead of \w. Translated into English, it means "Any character that is not a non-alphanumeric character ([^\W] is the same as \w), but that is also not a digit and not an underscore".

Therefore, it will only allow Unicode letters.

Match any unicode letter?

Python's re module doesn't support Unicode properties yet. But you can compile your regex using the re.UNICODE flag, and then the character class shorthand \w will match Unicode letters, too.

Since \w will also match digits, you need to then subtract those from your character class, along with the underscore:

[^\W\d_]

will match any Unicode letter.

>>> import re
>>> r = re.compile(r'[^\W\d_]', re.U)
>>> r.match('x')
<_sre.SRE_Match object at 0x0000000001DBCF38>
>>> r.match(u'é')
<_sre.SRE_Match object at 0x0000000002253030>

Regex - Match only unicode alphabet not numbers

The regex engine need to know that the target string is an unicode string (to avoid interpretation errors). To do that you can use the u modifier, that has two functions:

  • it expands classical shorthand character classes like \w \d to unicode characters (and not only ascii characters)
  • it forces the string to be seen as an unicode string

So you can use: /\pL+/u

Note that in your particular case, the first behavior is not needed, but you can only switch on the second behavior with: /(*UTF8)\pL+/ ((*UTF8) must be placed at the very begining of the pattern)

JavaScript regex pattern for any visible unicode letter characters

Use XRegExp library to parse your current regular expression:

var pattern = new XRegExp("^[0-9\\p{L} _.]+$");var s = "123 Московская Street.";if (XRegExp.test(s, pattern)) {    console.log("Valid");}
<script src="https://cdnjs.cloudflare.com/ajax/libs/xregexp/3.2.0/xregexp-all.min.js"></script>

Matching every Unicode letter only in HTML5 Input form

If you're using a browser that does support \p{}, and doesn't require the u switch to enable it, your code works, but you should remove the brackets because they're unnecessary:

<input type="text" pattern="\p{L}+\s\p{L}+">

It worked when I tested it in Chrome.

Older Javascript versions (before ES2018?) do not support \p{} at all, and some versions may need the u switch to enable it, which won't work here. If you really need it, I suggest that you try the solutions here: How can I use Unicode-aware regular expressions in JavaScript?.

If you just don't like digits, then you can use \D as tamas rev said in the comments. Or maybe [^\d\s] to enforce that your input isn't just spaces.

Note that only matching letters is a bad way to validate names, since it excludes names like "O'Henry". Note that forcing exactly one space to be present excludes languages where the names are not separated with a space (like in the name "蔡英文"), people who only have one name, and people whose names have more than one space ("Mary Jane", "van der Waals"). And some names do have numbers. See Falsehoods Programmers Believe About Names.

Regex: Match everything except unicode letters

The [\W\d_] is a regex that matches any non-word char (any char not matched with \w), it matches digits with \d and a _. Note that \d in a Unicode aware Python 3 regex only matches \p{Nd} (Number, decimal):

Matches any Unicode decimal digit (that is, any character in Unicode character category [Nd]).

The chars this pattern does not remove in your string belong to the \p{No} Unicode category (numbers, other).

So, if you plan to also remove all those chars from \p{No}, you need to add them to the pattern:

r'[\u00B2\u00B3\u00B9\u00BC-\u00BE\u09F4-\u09F9\u0B72-\u0B77\u0BF0-\u0BF2\u0C78-\u0C7E\u0D58-\u0D5E\u0D70-\u0D78\u0F2A-\u0F33\u1369-\u137C\u17F0-\u17F9\u19DA\u2070\u2074-\u2079\u2080-\u2089\u2150-\u215F\u2189\u2460-\u249B\u24EA-\u24FF\u2776-\u2793\u2CFD\u3192-\u3195\u3220-\u3229\u3248-\u324F\u3251-\u325F\u3280-\u3289\u32B1-\u32BF\uA830-\uA835\U00010107-\U00010133\U00010175-\U00010178\U0001018A\U0001018B\U000102E1-\U000102FB\U00010320-\U00010323\U00010858-\U0001085F\U00010879-\U0001087F\U000108A7-\U000108AF\U000108FB-\U000108FF\U00010916-\U0001091B\U000109BC\U000109BD\U000109C0-\U000109CF\U000109D2-\U000109FF\U00010A40-\U00010A47\U00010A7D\U00010A7E\U00010A9D-\U00010A9F\U00010AEB-\U00010AEF\U00010B58-\U00010B5F\U00010B78-\U00010B7F\U00010BA9-\U00010BAF\U00010CFA-\U00010CFF\U00010E60-\U00010E7E\U00011052-\U00011065\U000111E1-\U000111F4\U0001173A\U0001173B\U000118EA-\U000118F2\U00011C5A-\U00011C6C\U00016B5B-\U00016B61\U0001D360-\U0001D371\U0001E8C7-\U0001E8CF\U0001F100-\U0001F10C\W\d_]+'

See the regex demo.

You may see the chars listed on this page page.

Also, be aware of a Number, letter category, see the \p{Nl} char list here.



Related Topics



Leave a reply



Submit