Python regex matching Unicode properties
Have you tried Ponyguruma, a Python binding to the Oniguruma regular expression engine? In that engine you can simply say \p{Armenian}
to match Armenian characters. \p{Ll}
or \p{Zs}
work too.
Unicode Regex with regex not working in Python
As you can see unicode character classes like \p{L}
are not available in the re module. However it doesn't means that you can't do it with the re module since \p{L}
can be replaced with [^\W\d_]
with the UNICODE
flag (even if there are small differences between these two character classes, see the link in comments).
Second point, your approach is not the good one (if I understand well, you are trying to extract the last word of each line) because you have strangely decided to remove all that is not the last word (except the newline) with a replacement. ~52000 steps to extract 10 words in 10 lines of text is not acceptable (and will crash with more characters). A more efficient way consists to find all the last words, see this example:
import re
s = '''Ik heb nog nooit een kat gezien zo lélijk!
Het is een minder lelijk dan uw hond.'''
p = re.compile(r'^.*\b(?<!-)(\w+(?:-\w+)*)', re.M | re.U)
words = p.findall(s)
print('\n'.join(words))
Notices:
To obtain the same result with python 2.7 you only need to add an
u
before the single quotes of the string:s = u'''...
If you absolutely want to limit results to letters avoiding digits and underscores, replace
\w
with[^\W\d_]
in the pattern.If you use the regex module, maybe the character class
\p{IsLatin}
will be more appropriate for your use, or whatever the module you choose, a more explicit class with only the needed characters, something like:[A-Za-záéóú...
You can achieve the same with the regex module with this pattern:
p = regex.compile(r'^.*\m(?<!-)(\pL+(?:-\pL+)*)', regex.M | regex.U)
Other ways:
By line with the re module:
p = re.compile(r'[^\w-]+', re.U)
for line in s.split('\n'):
print(p.split(line+' ')[-2])
With the regex module you can take advantage of the reversed search:
p = regex.compile(r'(?r)\w+(?:-\w+)*\M', regex.U)
for line in s.split('\n'):
print p.search(line).group(0)
Match any unicode letter?
Python's re
module doesn't support Unicode properties yet. But you can compile your regex using the re.UNICODE
flag, and then the character class shorthand \w
will match Unicode letters, too.
Since \w
will also match digits, you need to then subtract those from your character class, along with the underscore:
[^\W\d_]
will match any Unicode letter.
>>> import re
>>> r = re.compile(r'[^\W\d_]', re.U)
>>> r.match('x')
<_sre.SRE_Match object at 0x0000000001DBCF38>
>>> r.match(u'é')
<_sre.SRE_Match object at 0x0000000002253030>
Python and regular expression with Unicode
Are you using python 2.x or 3.0?
If you're using 2.x, try making the regex string a unicode-escape string, with 'u'. Since it's regex it's good practice to make your regex string a raw string, with 'r'. Also, putting your entire pattern in parentheses is superfluous.
re.sub(ur'[\u064B-\u0652\u06D4\u0670\u0674\u06D5-\u06ED]+', '', ...)
http://docs.python.org/tutorial/introduction.html#unicode-strings
Edit:
It's also good practice to use the re.UNICODE/re.U/(?u) flag for unicode regexes, but it only affects character class aliases like \w or \b, of which this pattern does not use any and so would not be affected by.
Matching only a unicode letter in Python re
You can construct a new character class:
[^\W\d_]
instead of \w
. Translated into English, it means "Any character that is not a non-alphanumeric character ([^\W]
is the same as \w
), but that is also not a digit and not an underscore".
Therefore, it will only allow Unicode letters.
How do I match all unicode lowercase characters in Python with a regular expression?
You can use the regex package if using a third party package is acceptable.
>>> import regex
>>> s = 'ABCabcÆæ'
>>> m = regex.findall(r'[[:lower:]]', s)
>>> m
['a', 'b', 'c', 'æ']
Python unicode regular expression matching failing with some unicode characters -bug or mistake?
It is a bug in the re
module and it is fixed in the regex
module:
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import unicodedata
import re
import regex # $ pip install regex
word = "किशोरी"
def test(re_):
assert re_.search("^\\w+$", word, flags=re_.UNICODE)
print([unicodedata.category(cp) for cp in word])
print(" ".join(ch for ch in regex.findall("\\X", word)))
assert all(regex.match("\\w$", c) for c in ["a", "\u093f", "\u0915"])
test(regex)
test(re) # fails
The output shows that there are 6 codepoints in "किशोरी"
, but only 3 user-perceived characters (extended grapheme clusters). It would be wrong to break a word inside a character. Unicode Text Segmentation says:
Word boundaries, line boundaries, and sentence boundaries should not
occur within a grapheme cluster: in other words, a grapheme cluster
should be an atomic unit with respect to the process of determining
these other boundaries.
here and further emphasis is mine
A word boundary \b
is defined as a transition from \w
to \W
(or in reverse) in the docs:
Note that formally, \b is defined as the boundary between a \w and a
\W character (or vice versa), or between \w and the beginning/end of
the string, ...
Therefore either all codepoints that form a single character are \w
or they are all \W
.
In this case "किशोरी"
matches ^\w{6}$
.
From the docs for \w
in Python 2:
If UNICODE is set, this will match the characters [0-9_] plus
whatever is classified as alphanumeric in the Unicode character
properties database.
in Python 3:
Matches Unicode word characters; this includes most characters that
can be part of a word in any language, as well as numbers and the
underscore.
From regex
docs:
Definition of 'word' character (issue #1693050):
The definition of a 'word' character has been expanded for Unicode. It now conforms to the Unicode specification at
http://www.unicode.org/reports/tr29/. This applies to \w, \W, \b and
\B.
According to unicode.org U+093F
(DEVANAGARI VOWEL SIGN I
) is alnum and alphabetic so regex
is also correct to consider it \w
even if we follow definitions that are not based on word boundaries.
Python regex uppercase unicode word
You need to use a Unicode character property in order to match them. re
does not support character properties, but regex
does.
>>> regex.findall(ur'\p{Lu}', u'ÜìÑ')
[u'\xdc', u'\xd1']
matching unicode characters in python regular expressions
You need to specify the re.UNICODE
flag, and input your string as a Unicode string by using the u
prefix:
>>> re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', u'/by_tag/påske/øyfjell.jpg', re.UNICODE).groupdict()
{'tag': u'p\xe5ske', 'filename': u'\xf8yfjell.jpg'}
This is in Python 2; in Python 3 you must leave out the u
because all strings are Unicode, and you can leave off the re.UNICODE
flag.
Related Topics
Mkdir -P Functionality in Python
How to Reset Index in a Pandas Dataframe
Using Os.Walk() to Recursively Traverse Directories in Python
How to Read a Text File into a List or an Array with Python
How to Have Clusters of Stacked Bars
Single VS Double Quotes in JSON
Using Python Requests with JavaScript Pages
Main() Function Doesn't Run When Running Script
What Are the Arguments to Tkinter Variable Trace Method Callbacks
How to Initialize a Two-Dimensional Array in Python
Scrape Multiple Urls Using Qwebpage
Numpy Array Is Not JSON Serializable
How to Delete Rows from a Pandas Dataframe Based on a Conditional Expression
Plot Logarithmic Axes with Matplotlib in Python
Change One Value Based on Another Value in Pandas