Match any unicode letter?
Python's re
module doesn't support Unicode properties yet. But you can compile your regex using the re.UNICODE
flag, and then the character class shorthand \w
will match Unicode letters, too.
Since \w
will also match digits, you need to then subtract those from your character class, along with the underscore:
[^\W\d_]
will match any Unicode letter.>>> import re
>>> r = re.compile(r'[^\W\d_]', re.U)
>>> r.match('x')
<_sre.SRE_Match object at 0x0000000001DBCF38>
>>> r.match(u'é')
<_sre.SRE_Match object at 0x0000000002253030>
Is There a Way to Match Any Unicode Alphabetic Character?
Check out Unicode character properties: http://www.regular-expressions.info/unicode.html#prop. I think what you are looking for is probably
\p{L}
which will match any letters or ideographs. You may also want to include letters with marks on them, so you could do\p{L}\p{M}*
In any case, all the different types of character properties are detailed in the first link.Edit: You may also want to look at this Stack Overflow answer discussing whether \w matches unicode characters. They suggest that you could also use \p{Word} or \p{Alnum}: Does \w match all alphanumeric characters defined in the Unicode standard?
JavaScript regex pattern for any visible unicode letter characters
Use XRegExp
library to parse your current regular expression:
var pattern = new XRegExp("^[0-9\\p{L} _.]+$");var s = "123 Московская Street.";if (XRegExp.test(s, pattern)) { console.log("Valid");}
<script src="https://cdnjs.cloudflare.com/ajax/libs/xregexp/3.2.0/xregexp-all.min.js"></script>
Regex: Match everything except unicode letters
The [\W\d_]
is a regex that matches any non-word char (any char not matched with \w
), it matches digits with \d
and a _
. Note that \d
in a Unicode aware Python 3 regex only matches \p{Nd}
(Number, decimal):
The chars this pattern does not remove in your string belong to theMatches any Unicode decimal digit (that is, any character in Unicode character category
[Nd]
).
\p{No}
Unicode category (numbers, other).So, if you plan to also remove all those chars from \p{No}
, you need to add them to the pattern:
r'[\u00B2\u00B3\u00B9\u00BC-\u00BE\u09F4-\u09F9\u0B72-\u0B77\u0BF0-\u0BF2\u0C78-\u0C7E\u0D58-\u0D5E\u0D70-\u0D78\u0F2A-\u0F33\u1369-\u137C\u17F0-\u17F9\u19DA\u2070\u2074-\u2079\u2080-\u2089\u2150-\u215F\u2189\u2460-\u249B\u24EA-\u24FF\u2776-\u2793\u2CFD\u3192-\u3195\u3220-\u3229\u3248-\u324F\u3251-\u325F\u3280-\u3289\u32B1-\u32BF\uA830-\uA835\U00010107-\U00010133\U00010175-\U00010178\U0001018A\U0001018B\U000102E1-\U000102FB\U00010320-\U00010323\U00010858-\U0001085F\U00010879-\U0001087F\U000108A7-\U000108AF\U000108FB-\U000108FF\U00010916-\U0001091B\U000109BC\U000109BD\U000109C0-\U000109CF\U000109D2-\U000109FF\U00010A40-\U00010A47\U00010A7D\U00010A7E\U00010A9D-\U00010A9F\U00010AEB-\U00010AEF\U00010B58-\U00010B5F\U00010B78-\U00010B7F\U00010BA9-\U00010BAF\U00010CFA-\U00010CFF\U00010E60-\U00010E7E\U00011052-\U00011065\U000111E1-\U000111F4\U0001173A\U0001173B\U000118EA-\U000118F2\U00011C5A-\U00011C6C\U00016B5B-\U00016B61\U0001D360-\U0001D371\U0001E8C7-\U0001E8CF\U0001F100-\U0001F10C\W\d_]+'
See the regex demo.You may see the chars listed on this page page.
Also, be aware of a Number, letter category, see the \p{Nl}
char list here.
How to match a complete string containing unicode characters?
You may use
^[\p{L}\p{M}]+$
See Go demo.Details
^
- start of string[
- start of a character class that matches\p{L}
- any BMP letter\p{M}
- any diacritic
]+
- end of the character class, repeat 1+ times$
- end of string.
_
as \w
does, add them to the character class, ^[\p{L}\p{M}0-9_]+$
or ^[\p{L}\p{M}\p{N}_]+$
. Matching every Unicode letter only in HTML5 Input form
If you're using a browser that does support \p{}
, and doesn't require the u
switch to enable it, your code works, but you should remove the brackets because they're unnecessary:
<input type="text" pattern="\p{L}+\s\p{L}+">
It worked when I tested it in Chrome.Older Javascript versions (before ES2018?) do not support \p{}
at all, and some versions may need the u
switch to enable it, which won't work here. If you really need it, I suggest that you try the solutions here: How can I use Unicode-aware regular expressions in JavaScript?.
If you just don't like digits, then you can use \D
as tamas rev said in the comments. Or maybe [^\d\s]
to enforce that your input isn't just spaces.
Note that only matching letters is a bad way to validate names, since it excludes names like "O'Henry". Note that forcing exactly one space to be present excludes languages where the names are not separated with a space (like in the name "蔡英文"), people who only have one name, and people whose names have more than one space ("Mary Jane", "van der Waals"). And some names do have numbers. See Falsehoods Programmers Believe About Names.
How to write regular expression matching all unicode characters in Python?
You can combine a negative lookahead with \w
to match "word characters" excluding digits and underscores:
re.compile(r"(?:(?![\d_])\w)+", re.UNICODE)
Matching only a unicode letter in Python re
You can construct a new character class:
[^\W\d_]
instead of \w
. Translated into English, it means "Any character that is not a non-alphanumeric character ([^\W]
is the same as \w
), but that is also not a digit and not an underscore".Therefore, it will only allow Unicode letters.
How can I match unicode characters and non digits using regex?
You seem to want to match any chars after \x80
or ASCII letters.
In this case, you may use
[a-zA-Z\u0080-\uFFFF]+
Note that you should no longer rely on word boundaries, as the pattern can match non-word chars now (your previous one only matched "word" chars).See the regex demo.
Note that you should only test your regex pattern in those online testers that are compatible with your target regex library. regex101.com has proved to be a good tester for PCRE, JS, Python and Go patterns. Regexr currently only supports JS and PCRE flavors.
Related Topics
Fitting a Closed Curve to a Set of Points
How to Ignore One Single Specific Line with Pylint
How to Limit the Maximum Value of a Numeric Field in a Django Model
Getting Data from Ctypes Array into Numpy
How to Restrict Foreign Keys Choices to Related Objects Only in Django
Using Configparser to Read a File Without Section Name
Overflowerror: (34, 'Result Too Large')
Python, Remove All Non-Alphabet Chars from String
Should All Python Classes Extend Object
How to Flatten a Pandas Dataframe with Some Columns as JSON
Return List of Items in List Greater Than Some Value
Pairwise Crossproduct in Python
What Do Backticks Mean to the Python Interpreter? Example: 'Num'
What Is the Default _Hash_ in Python
How to Add Trendline in Python Matplotlib Dot (Scatter) Graphs