Match Any Unicode Letter

Match any unicode letter?

Python's re module doesn't support Unicode properties yet. But you can compile your regex using the re.UNICODE flag, and then the character class shorthand \w will match Unicode letters, too.

Since \w will also match digits, you need to then subtract those from your character class, along with the underscore:

[^\W\d_]

will match any Unicode letter.

>>> import re
>>> r = re.compile(r'[^\W\d_]', re.U)
>>> r.match('x')
<_sre.SRE_Match object at 0x0000000001DBCF38>
>>> r.match(u'é')
<_sre.SRE_Match object at 0x0000000002253030>

Is There a Way to Match Any Unicode Alphabetic Character?

Check out Unicode character properties: http://www.regular-expressions.info/unicode.html#prop. I think what you are looking for is probably

\p{L}

which will match any letters or ideographs. You may also want to include letters with marks on them, so you could do

\p{L}\p{M}*

In any case, all the different types of character properties are detailed in the first link.

Edit: You may also want to look at this Stack Overflow answer discussing whether \w matches unicode characters. They suggest that you could also use \p{Word} or \p{Alnum}: Does \w match all alphanumeric characters defined in the Unicode standard?

JavaScript regex pattern for any visible unicode letter characters

Use XRegExp library to parse your current regular expression:

var pattern = new XRegExp("^[0-9\\p{L} _.]+$");var s = "123 Московская Street.";if (XRegExp.test(s, pattern)) {    console.log("Valid");}
<script src="https://cdnjs.cloudflare.com/ajax/libs/xregexp/3.2.0/xregexp-all.min.js"></script>

Regex: Match everything except unicode letters

The [\W\d_] is a regex that matches any non-word char (any char not matched with \w), it matches digits with \d and a _. Note that \d in a Unicode aware Python 3 regex only matches \p{Nd} (Number, decimal):

Matches any Unicode decimal digit (that is, any character in Unicode character category [Nd]).

The chars this pattern does not remove in your string belong to the \p{No} Unicode category (numbers, other).

So, if you plan to also remove all those chars from \p{No}, you need to add them to the pattern:

r'[\u00B2\u00B3\u00B9\u00BC-\u00BE\u09F4-\u09F9\u0B72-\u0B77\u0BF0-\u0BF2\u0C78-\u0C7E\u0D58-\u0D5E\u0D70-\u0D78\u0F2A-\u0F33\u1369-\u137C\u17F0-\u17F9\u19DA\u2070\u2074-\u2079\u2080-\u2089\u2150-\u215F\u2189\u2460-\u249B\u24EA-\u24FF\u2776-\u2793\u2CFD\u3192-\u3195\u3220-\u3229\u3248-\u324F\u3251-\u325F\u3280-\u3289\u32B1-\u32BF\uA830-\uA835\U00010107-\U00010133\U00010175-\U00010178\U0001018A\U0001018B\U000102E1-\U000102FB\U00010320-\U00010323\U00010858-\U0001085F\U00010879-\U0001087F\U000108A7-\U000108AF\U000108FB-\U000108FF\U00010916-\U0001091B\U000109BC\U000109BD\U000109C0-\U000109CF\U000109D2-\U000109FF\U00010A40-\U00010A47\U00010A7D\U00010A7E\U00010A9D-\U00010A9F\U00010AEB-\U00010AEF\U00010B58-\U00010B5F\U00010B78-\U00010B7F\U00010BA9-\U00010BAF\U00010CFA-\U00010CFF\U00010E60-\U00010E7E\U00011052-\U00011065\U000111E1-\U000111F4\U0001173A\U0001173B\U000118EA-\U000118F2\U00011C5A-\U00011C6C\U00016B5B-\U00016B61\U0001D360-\U0001D371\U0001E8C7-\U0001E8CF\U0001F100-\U0001F10C\W\d_]+'

See the regex demo.

You may see the chars listed on this page page.

Also, be aware of a Number, letter category, see the \p{Nl} char list here.

How to match a complete string containing unicode characters?

You may use

^[\p{L}\p{M}]+$

See Go demo.

Details

  • ^ - start of string
  • [ - start of a character class that matches

    • \p{L} - any BMP letter
    • \p{M} - any diacritic
  • ]+ - end of the character class, repeat 1+ times
  • $ - end of string.

If you plan to also match digits and _ as \w does, add them to the character class, ^[\p{L}\p{M}0-9_]+$ or ^[\p{L}\p{M}\p{N}_]+$.

Matching every Unicode letter only in HTML5 Input form

If you're using a browser that does support \p{}, and doesn't require the u switch to enable it, your code works, but you should remove the brackets because they're unnecessary:

<input type="text" pattern="\p{L}+\s\p{L}+">

It worked when I tested it in Chrome.

Older Javascript versions (before ES2018?) do not support \p{} at all, and some versions may need the u switch to enable it, which won't work here. If you really need it, I suggest that you try the solutions here: How can I use Unicode-aware regular expressions in JavaScript?.

If you just don't like digits, then you can use \D as tamas rev said in the comments. Or maybe [^\d\s] to enforce that your input isn't just spaces.

Note that only matching letters is a bad way to validate names, since it excludes names like "O'Henry". Note that forcing exactly one space to be present excludes languages where the names are not separated with a space (like in the name "蔡英文"), people who only have one name, and people whose names have more than one space ("Mary Jane", "van der Waals"). And some names do have numbers. See Falsehoods Programmers Believe About Names.

How to write regular expression matching all unicode characters in Python?

You can combine a negative lookahead with \w to match "word characters" excluding digits and underscores:

re.compile(r"(?:(?![\d_])\w)+", re.UNICODE)

Matching only a unicode letter in Python re

You can construct a new character class:

[^\W\d_]

instead of \w. Translated into English, it means "Any character that is not a non-alphanumeric character ([^\W] is the same as \w), but that is also not a digit and not an underscore".

Therefore, it will only allow Unicode letters.

How can I match unicode characters and non digits using regex?

You seem to want to match any chars after \x80 or ASCII letters.

In this case, you may use

[a-zA-Z\u0080-\uFFFF]+

Note that you should no longer rely on word boundaries, as the pattern can match non-word chars now (your previous one only matched "word" chars).

See the regex demo.

Note that you should only test your regex pattern in those online testers that are compatible with your target regex library. regex101.com has proved to be a good tester for PCRE, JS, Python and Go patterns. Regexr currently only supports JS and PCRE flavors.



Related Topics



Leave a reply



Submit