Python and regular expression with Unicode
Are you using python 2.x or 3.0?
If you're using 2.x, try making the regex string a unicode-escape string, with 'u'. Since it's regex it's good practice to make your regex string a raw string, with 'r'. Also, putting your entire pattern in parentheses is superfluous.
re.sub(ur'[\u064B-\u0652\u06D4\u0670\u0674\u06D5-\u06ED]+', '', ...)
http://docs.python.org/tutorial/introduction.html#unicode-strings
Edit:
It's also good practice to use the re.UNICODE/re.U/(?u) flag for unicode regexes, but it only affects character class aliases like \w or \b, of which this pattern does not use any and so would not be affected by.
Python regular expression with unicode
Method 1 does not work because \u####
doesn't mean anything in the case of an encoded sequence. Instead, you need the correct sequence in bytes. If you do this, then method 1 will produce the same results as method 2. I modified your code as follows:
# -*- coding: utf-8 -*-
import sys
import re
text = """
saú$_ß$¤×÷asd县阴őasdCharacters: \"县阴 asdsadsasd县阴
"""
text = unicode(text, "utf-8")
print("\nMethod 1\n")
reg = "Characters: \"[\xe4\xb8\x80-\xe9\xbf\xbf]+.*?"
reg = unicode(reg, "utf-8")
pattern = re.compile(reg, re.UNICODE | re.MULTILINE)
for m in re.findall(pattern, text): # Number of occurrences in the 'k' line.
print("Results: %s" % m.encode(sys.stdout.encoding, errors='replace'))
print("\nMethod 2\n")
reg = u"Characters: \"[\u4e00-\u9fff]+.*?"
pattern = re.compile(reg, re.UNICODE | re.MULTILINE)
for m in re.findall(pattern, text): # Number of occurrences in the 'k' line.
print("Results: %s" % m.encode(sys.stdout.encoding, errors='replace'))
It produces the following results on my machine:
Method 1
Results: Characters: "县阴
Method 2
Results: Characters: "县阴
Python regex for unicode capitalized words
If you need to use a regex, you have 2 options:
- Install PyPi
regex
module and use\p{Lu}
or[[:upper:]]
(having more uppercase chars in it) class (make sure you have the latest version installed) - Use
re
with a character class containing all uppercase letter ranges, either using Python utilities (and then the amount of the Unicode letters matched will depend on the Python version, the latest having up-to-date data) or by manually creating/updating the range from the Unicode Utilities CLDR page.
Here is a solution with a regex containing all uppercase letter ranges taken from Unicode Utilities CLDR reference page:
import re
pLu = "[A-Z\u00C0-\u00D6\u00D8-\u00DE\u0100\u0102\u0104\u0106\u0108\u010A\u010C\u010E\u0110\u0112\u0114\u0116\u0118\u011A\u011C\u011E\u0120\u0122\u0124\u0126\u0128\u012A\u012C\u012E\u0130\u0132\u0134\u0136\u0139\u013B\u013D\u013F\u0141\u0143\u0145\u0147\u014A\u014C\u014E\u0150\u0152\u0154\u0156\u0158\u015A\u015C\u015E\u0160\u0162\u0164\u0166\u0168\u016A\u016C\u016E\u0170\u0172\u0174\u0176\u0178\u0179\u017B\u017D\u0181\u0182\u0184\u0186\u0187\u0189-\u018B\u018E-\u0191\u0193\u0194\u0196-\u0198\u019C\u019D\u019F\u01A0\u01A2\u01A4\u01A6\u01A7\u01A9\u01AC\u01AE\u01AF\u01B1-\u01B3\u01B5\u01B7\u01B8\u01BC\u01C4\u01C7\u01CA\u01CD\u01CF\u01D1\u01D3\u01D5\u01D7\u01D9\u01DB\u01DE\u01E0\u01E2\u01E4\u01E6\u01E8\u01EA\u01EC\u01EE\u01F1\u01F4\u01F6-\u01F8\u01FA\u01FC\u01FE\u0200\u0202\u0204\u0206\u0208\u020A\u020C\u020E\u0210\u0212\u0214\u0216\u0218\u021A\u021C\u021E\u0220\u0222\u0224\u0226\u0228\u022A\u022C\u022E\u0230\u0232\u023A\u023B\u023D\u023E\u0241\u0243-\u0246\u0248\u024A\u024C\u024E\u0370\u0372\u0376\u037F\u0386\u0388-\u038A\u038C\u038E\u038F\u0391-\u03A1\u03A3-\u03AB\u03CF\u03D2-\u03D4\u03D8\u03DA\u03DC\u03DE\u03E0\u03E2\u03E4\u03E6\u03E8\u03EA\u03EC\u03EE\u03F4\u03F7\u03F9\u03FA\u03FD-\u042F\u0460\u0462\u0464\u0466\u0468\u046A\u046C\u046E\u0470\u0472\u0474\u0476\u0478\u047A\u047C\u047E\u0480\u048A\u048C\u048E\u0490\u0492\u0494\u0496\u0498\u049A\u049C\u049E\u04A0\u04A2\u04A4\u04A6\u04A8\u04AA\u04AC\u04AE\u04B0\u04B2\u04B4\u04B6\u04B8\u04BA\u04BC\u04BE\u04C0\u04C1\u04C3\u04C5\u04C7\u04C9\u04CB\u04CD\u04D0\u04D2\u04D4\u04D6\u04D8\u04DA\u04DC\u04DE\u04E0\u04E2\u04E4\u04E6\u04E8\u04EA\u04EC\u04EE\u04F0\u04F2\u04F4\u04F6\u04F8\u04FA\u04FC\u04FE\u0500\u0502\u0504\u0506\u0508\u050A\u050C\u050E\u0510\u0512\u0514\u0516\u0518\u051A\u051C\u051E\u0520\u0522\u0524\u0526\u0528\u052A\u052C\u052E\u0531-\u0556\u10A0-\u10C5\u10C7\u10CD\u13A0-\u13F5\u1E00\u1E02\u1E04\u1E06\u1E08\u1E0A\u1E0C\u1E0E\u1E10\u1E12\u1E14\u1E16\u1E18\u1E1A\u1E1C\u1E1E\u1E20\u1E22\u1E24\u1E26\u1E28\u1E2A\u1E2C\u1E2E\u1E30\u1E32\u1E34\u1E36\u1E38\u1E3A\u1E3C\u1E3E\u1E40\u1E42\u1E44\u1E46\u1E48\u1E4A\u1E4C\u1E4E\u1E50\u1E52\u1E54\u1E56\u1E58\u1E5A\u1E5C\u1E5E\u1E60\u1E62\u1E64\u1E66\u1E68\u1E6A\u1E6C\u1E6E\u1E70\u1E72\u1E74\u1E76\u1E78\u1E7A\u1E7C\u1E7E\u1E80\u1E82\u1E84\u1E86\u1E88\u1E8A\u1E8C\u1E8E\u1E90\u1E92\u1E94\u1E9E\u1EA0\u1EA2\u1EA4\u1EA6\u1EA8\u1EAA\u1EAC\u1EAE\u1EB0\u1EB2\u1EB4\u1EB6\u1EB8\u1EBA\u1EBC\u1EBE\u1EC0\u1EC2\u1EC4\u1EC6\u1EC8\u1ECA\u1ECC\u1ECE\u1ED0\u1ED2\u1ED4\u1ED6\u1ED8\u1EDA\u1EDC\u1EDE\u1EE0\u1EE2\u1EE4\u1EE6\u1EE8\u1EEA\u1EEC\u1EEE\u1EF0\u1EF2\u1EF4\u1EF6\u1EF8\u1EFA\u1EFC\u1EFE\u1F08-\u1F0F\u1F18-\u1F1D\u1F28-\u1F2F\u1F38-\u1F3F\u1F48-\u1F4D\u1F59\u1F5B\u1F5D\u1F5F\u1F68-\u1F6F\u1FB8-\u1FBB\u1FC8-\u1FCB\u1FD8-\u1FDB\u1FE8-\u1FEC\u1FF8-\u1FFB\u2102\u2107\u210B-\u210D\u2110-\u2112\u2115\u2119-\u211D\u2124\u2126\u2128\u212A-\u212D\u2130-\u2133\u213E\u213F\u2145\u2160-\u216F\u2183\u24B6-\u24CF\u2C00-\u2C2E\u2C60\u2C62-\u2C64\u2C67\u2C69\u2C6B\u2C6D-\u2C70\u2C72\u2C75\u2C7E-\u2C80\u2C82\u2C84\u2C86\u2C88\u2C8A\u2C8C\u2C8E\u2C90\u2C92\u2C94\u2C96\u2C98\u2C9A\u2C9C\u2C9E\u2CA0\u2CA2\u2CA4\u2CA6\u2CA8\u2CAA\u2CAC\u2CAE\u2CB0\u2CB2\u2CB4\u2CB6\u2CB8\u2CBA\u2CBC\u2CBE\u2CC0\u2CC2\u2CC4\u2CC6\u2CC8\u2CCA\u2CCC\u2CCE\u2CD0\u2CD2\u2CD4\u2CD6\u2CD8\u2CDA\u2CDC\u2CDE\u2CE0\u2CE2\u2CEB\u2CED\u2CF2\uA640\uA642\uA644\uA646\uA648\uA64A\uA64C\uA64E\uA650\uA652\uA654\uA656\uA658\uA65A\uA65C\uA65E\uA660\uA662\uA664\uA666\uA668\uA66A\uA66C\uA680\uA682\uA684\uA686\uA688\uA68A\uA68C\uA68E\uA690\uA692\uA694\uA696\uA698\uA69A\uA722\uA724\uA726\uA728\uA72A\uA72C\uA72E\uA732\uA734\uA736\uA738\uA73A\uA73C\uA73E\uA740\uA742\uA744\uA746\uA748\uA74A\uA74C\uA74E\uA750\uA752\uA754\uA756\uA758\uA75A\uA75C\uA75E\uA760\uA762\uA764\uA766\uA768\uA76A\uA76C\uA76E\uA779\uA77B\uA77D\uA77E\uA780\uA782\uA784\uA786\uA78B\uA78D\uA790\uA792\uA796\uA798\uA79A\uA79C\uA79E\uA7A0\uA7A2\uA7A4\uA7A6\uA7A8\uA7AA-\uA7AE\uA7B0-\uA7B4\uA7B6\uFF21-\uFF3A\U00010400-\U00010427\U000104B0-\U000104D3\U00010C80-\U00010CB2\U000118A0-\U000118BF\U0001D400-\U0001D419\U0001D434-\U0001D44D\U0001D468-\U0001D481\U0001D49C\U0001D49E\U0001D49F\U0001D4A2\U0001D4A5\U0001D4A6\U0001D4A9-\U0001D4AC\U0001D4AE-\U0001D4B5\U0001D4D0-\U0001D4E9\U0001D504\U0001D505\U0001D507-\U0001D50A\U0001D50D-\U0001D514\U0001D516-\U0001D51C\U0001D538\U0001D539\U0001D53B-\U0001D53E\U0001D540-\U0001D544\U0001D546\U0001D54A-\U0001D550\U0001D56C-\U0001D585\U0001D5A0-\U0001D5B9\U0001D5D4-\U0001D5ED\U0001D608-\U0001D621\U0001D63C-\U0001D655\U0001D670-\U0001D689\U0001D6A8-\U0001D6C0\U0001D6E2-\U0001D6FA\U0001D71C-\U0001D734\U0001D756-\U0001D76E\U0001D790-\U0001D7A8\U0001D7CA\U0001E900-\U0001E921\U0001F130-\U0001F149\U0001F150-\U0001F169\U0001F170-\U0001F189]"
p = re.compile(pLu)
if p.match("Żółw"):
print("Capitalized!")
See the IDEONE demo. To make it work in Python 2.x, make sure you add u
prefix to the string literals.
There are other ways to get the Unicode upper-case letter character class in Python using unicodedata
and sys
packages like
# Python 3
pLu = '[{}]'.format("".join([chr(i) for i in range(sys.maxunicode) if chr(i).isupper()]))
# Python 2
pLu = u'[{}]'.format(u"".join([unichr(i) for i in xrange(sys.maxunicode) if unichr(i).isupper()]))
However, this range does not match all uppercase letters displayed at the Unicode Utilities: UnicodeSet page for [:upper:]
POSIX character class.
Cf.:
- Python 2.7
len([unichr(i) for i in xrange(sys.maxunicode) if unichr(i).isupper()])
displays1427
- Python 3.5
len([chr(i) for i in range(sys.maxunicode) if chr(i).isupper()])
shows1751
- Python 3.6
len([chr(i) for i in range(sys.maxunicode) if chr(i).isupper()])
shows1822
- Current Unicode Utilities CLDR page displays
1,822
uppercase letters for[:upper:]
class, and1,702
for the\p{Lu}
.
With PyPi regex
module, it is simpler:
import regex
p = regex.compile(r"\p{Lu}") # To support (currently) 1702 uppercase letters
# p = regex.compile(r"[[:upper:]]") # To support (currently) 1822 uppercase letters
if p.match("Żółw"):
print("Capitalized!")
In Python 2.x you should use:
p = regex.compile(ur"\p{Lu}")
p = regex.compile(ur"[[:upper:]]")
or
p = regex.compile(r"\p{Lu}", regex.U)
p = regex.compile(r"[[:upper:]]", regex.U)
How to write regular expression matching all unicode characters in Python?
You can combine a negative lookahead with \w
to match "word characters" excluding digits and underscores:
re.compile(r"(?:(?![\d_])\w)+", re.UNICODE)
Python regex: pattern with re.ASCII can still match unicode characters?
The re.A
flag only affects what shorthand character classes match.
In Python 3.x, shorthand character classes are Unicode aware, the Python 2.x re.UNICODE
/re.U
is ON by default. That means:
\d
: Matches any Unicode decimal digit (that is, any character in Unicode character category [Nd])\D
: Matches any character which is not a decimal digit. (So, all characters other than those in theNd
Unicode category).\w
- Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. (So,\w+
matches each word in aMy name is Виктор
string)\W
- Matches any character which is not a word character. This is the opposite of\w
. (So, it will not match any Unicode letter or digit.)\s
- Matches Unicode whitespace characters (it will matchNEL
, hard spaces, etc.)\S
- Matches any character which is not a whitespace character. (So, no match forNEL
, hard space, etc.)\b
- word boundaries match locations between Unicode letters/digits and non-letters/digits or start/end of string.\B
- non-word boundaries match locations between two Unicode letters/digits, two non-letters/digits or between a Unicode non-letter/digit and start/end of string.
If you want to disable this behavior, you use re.A
or re.ASCII
:
Make
\w
,\W
,\b
,\B
,\d
,\D
,\s
and\S
perform ASCII-only matching instead of full Unicode matching. This is only meaningful for Unicode patterns, and is ignored for byte patterns. Corresponds to the inline flag(?a)
.
That means that:
\d
=[0-9]
- and no longer matches Hindi, Bengali, etc. digits\D
=[^0-9]
- and matches any characters other than ASCII digits (i.e. it acts as(?u)(?![0-9])\d
now)\w
=[A-Za-z0-9_]
- and it only matches ASCII words now,Wiktor
is matched with\w+
, butВиктор
does not\W
=[^A-Za-z0-9_]
- it matches any char but ASCII letters/digits/_
(i.e. it matches你好吗
,Виктор
, etc.\s
=[ \t\n\r\f\v]
- matches a regular space, tab, linefeed, carriage return, form feed and a vertical tab\S
=[^ \t\n\r\f\v]
- matches any char other than a space, tab, linefeed, carriage return, form feed and a vertical tab, so it matches all Unicode letters, digits and punctuation and Unicode (non-ASCII) whitespace. E.g.,re.sub(r'\S+', r'{\g<0>}', '\xA0 ', flags=re.A)
will return'{ } '
, as you see, the\S
now matches hard spaces.
How do I match all unicode lowercase characters in Python with a regular expression?
You can use the regex package if using a third party package is acceptable.
>>> import regex
>>> s = 'ABCabcÆæ'
>>> m = regex.findall(r'[[:lower:]]', s)
>>> m
['a', 'b', 'c', 'æ']
Python - regex - special characters and ñ
Regex with accented characters (diacritics) in Python
The re.UNICODE
flag allows you to use word characters \w
and word boundaries \b
with diacritics (accents and tildes). This is extremely useful to match words in different languages.
- Decode your text from UTF-8 to unicode
- Make sure the pattern and the subject text are passed as unicode to the regex functions.
- The result is an array of bytes that can be looped/mapped to encode back again to UTF-8
- Printing the array shows non-ASCII bytes escaped, but it's safe to print each string independently.
Code:
# -*- coding: utf-8 -*-
# http://stackoverflow.com/q/32872917/5290909
#python 2.7.9
import re
text = "Solo voy si se sucedierón o se suceden mañana los siguienñes eventos:"
# Decode to unicode
unicode_text = text.decode('utf8')
matches = re.findall(ur'\b\w+\b', unicode_text, re.UNICODE)
# Encode back again to UTF-8
utf8_matches = [ match.encode('utf-8') for match in matches ]
# Print every word
for utf8_word in utf8_matches:
print utf8_word
ideone Demo
matching unicode characters in python regular expressions
You need to specify the re.UNICODE
flag, and input your string as a Unicode string by using the u
prefix:
>>> re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', u'/by_tag/påske/øyfjell.jpg', re.UNICODE).groupdict()
{'tag': u'p\xe5ske', 'filename': u'\xf8yfjell.jpg'}
This is in Python 2; in Python 3 you must leave out the u
because all strings are Unicode, and you can leave off the re.UNICODE
flag.
Related Topics
How to Execute Raw SQL in Flask-Sqlalchemy App
How to Find Children of Nodes Using Beautifulsoup
How to Search Directories and Find Files That Match Regex
Conda Reports Packagesnotfounderror: Python=3.1 for Reticulate Environment
Combine Two Pandas Data Frames (Join on a Common Column)
Django Set Default Form Values
How to Find Length of Digits in an Integer
Polling the Keyboard (Detect a Keypress) in Python
Why Does Pyimport_Import Fail to Load a Module from the Current Directory
Computing Cross-Correlation Function
How to Define a Threshold Value to Detect Only Green Colour Objects in an Image with Python Opencv
Replace All Elements of Python Numpy Array That Are Greater Than Some Value
How to Get a Value of Datetime.Today() in Python That Is "Timezone Aware"