Python Regex Matching Unicode Properties

Python regex matching Unicode properties

Have you tried Ponyguruma, a Python binding to the Oniguruma regular expression engine? In that engine you can simply say \p{Armenian} to match Armenian characters. \p{Ll} or \p{Zs} work too.

Unicode Regex with regex not working in Python

As you can see unicode character classes like \p{L} are not available in the re module. However it doesn't means that you can't do it with the re module since \p{L} can be replaced with [^\W\d_] with the UNICODE flag (even if there are small differences between these two character classes, see the link in comments).

Second point, your approach is not the good one (if I understand well, you are trying to extract the last word of each line) because you have strangely decided to remove all that is not the last word (except the newline) with a replacement. ~52000 steps to extract 10 words in 10 lines of text is not acceptable (and will crash with more characters). A more efficient way consists to find all the last words, see this example:

import re

s = '''Ik heb nog nooit een kat gezien zo lélijk!
Het is een minder lelijk dan uw hond.'''

p = re.compile(r'^.*\b(?<!-)(\w+(?:-\w+)*)', re.M | re.U) 

words = p.findall(s)

print('\n'.join(words))

Notices:

To obtain the same result with python 2.7 you only need to add an u before the single quotes of the string: s = u'''...
If you absolutely want to limit results to letters avoiding digits and underscores, replace \w with [^\W\d_] in the pattern.
If you use the regex module, maybe the character class \p{IsLatin} will be more appropriate for your use, or whatever the module you choose, a more explicit class with only the needed characters, something like: [A-Za-záéóú...
You can achieve the same with the regex module with this pattern:

p = regex.compile(r'^.*\m(?<!-)(\pL+(?:-\pL+)*)', regex.M | regex.U)

Other ways:

By line with the re module:

p = re.compile(r'[^\w-]+', re.U)
for line in s.split('\n'):
    print(p.split(line+' ')[-2])

With the regex module you can take advantage of the reversed search:

p = regex.compile(r'(?r)\w+(?:-\w+)*\M', regex.U)
for line in s.split('\n'):
    print p.search(line).group(0)

Match any unicode letter?

Python's re module doesn't support Unicode properties yet. But you can compile your regex using the re.UNICODE flag, and then the character class shorthand \w will match Unicode letters, too.

Since \w will also match digits, you need to then subtract those from your character class, along with the underscore:

[^\W\d_]

will match any Unicode letter.

>>> import re
>>> r = re.compile(r'[^\W\d_]', re.U)
>>> r.match('x')
<_sre.SRE_Match object at 0x0000000001DBCF38>
>>> r.match(u'é')
<_sre.SRE_Match object at 0x0000000002253030>

Python and regular expression with Unicode

Are you using python 2.x or 3.0?

If you're using 2.x, try making the regex string a unicode-escape string, with 'u'. Since it's regex it's good practice to make your regex string a raw string, with 'r'. Also, putting your entire pattern in parentheses is superfluous.

re.sub(ur'[\u064B-\u0652\u06D4\u0670\u0674\u06D5-\u06ED]+', '', ...)

http://docs.python.org/tutorial/introduction.html#unicode-strings

Edit:

It's also good practice to use the re.UNICODE/re.U/(?u) flag for unicode regexes, but it only affects character class aliases like \w or \b, of which this pattern does not use any and so would not be affected by.

Matching only a unicode letter in Python re

You can construct a new character class:

[^\W\d_]

instead of \w. Translated into English, it means "Any character that is not a non-alphanumeric character ([^\W] is the same as \w), but that is also not a digit and not an underscore".

Therefore, it will only allow Unicode letters.

How do I match all unicode lowercase characters in Python with a regular expression?

You can use the regex package if using a third party package is acceptable.

>>> import regex
>>> s = 'ABCabcÆæ'
>>> m = regex.findall(r'[[:lower:]]', s)
>>> m
['a', 'b', 'c', 'æ']

Python unicode regular expression matching failing with some unicode characters -bug or mistake?

It is a bug in the re module and it is fixed in the regex module:

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import unicodedata
import re
import regex  # $ pip install regex

word = "किशोरी"

def test(re_):
    assert re_.search("^\\w+$", word, flags=re_.UNICODE)

print([unicodedata.category(cp) for cp in word])
print(" ".join(ch for ch in regex.findall("\\X", word)))
assert all(regex.match("\\w$", c) for c in ["a", "\u093f", "\u0915"])

test(regex)
test(re)  # fails

The output shows that there are 6 codepoints in "किशोरी", but only 3 user-perceived characters (extended grapheme clusters). It would be wrong to break a word inside a character. Unicode Text Segmentation says:

Word boundaries, line boundaries, and sentence boundaries should not
occur within a grapheme cluster: in other words, a grapheme cluster
should be an atomic unit with respect to the process of determining
these other boundaries.

^{here and further emphasis is mine}

A word boundary \b is defined as a transition from \w to \W (or in reverse) in the docs:

Note that formally, \b is defined as the boundary between a \w and a
\W character (or vice versa), or between \w and the beginning/end of
the string, ...

Therefore either all codepoints that form a single character are \w or they are all \W.
In this case "किशोरी" matches ^\w{6}$.

From the docs for \w in Python 2:

If UNICODE is set, this will match the characters [0-9_] plus
whatever is classified as alphanumeric in the Unicode character
properties database.

in Python 3:

Matches Unicode word characters; this includes most characters that
can be part of a word in any language, as well as numbers and the
underscore.

From regex docs:

Definition of 'word' character (issue #1693050):

The definition of a 'word' character has been expanded for Unicode. It now conforms to the Unicode specification at
http://www.unicode.org/reports/tr29/. This applies to \w, \W, \b and
\B.

According to unicode.org U+093F (DEVANAGARI VOWEL SIGN I) is alnum and alphabetic so regex is also correct to consider it \w even if we follow definitions that are not based on word boundaries.

Python regex uppercase unicode word

You need to use a Unicode character property in order to match them. re does not support character properties, but regex does.

>>> regex.findall(ur'\p{Lu}', u'ÜìÑ')
[u'\xdc', u'\xd1']

matching unicode characters in python regular expressions

You need to specify the re.UNICODE flag, and input your string as a Unicode string by using the u prefix:

>>> re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', u'/by_tag/påske/øyfjell.jpg', re.UNICODE).groupdict()
{'tag': u'p\xe5ske', 'filename': u'\xf8yfjell.jpg'}

This is in Python 2; in Python 3 you must leave out the u because all strings are Unicode, and you can leave off the re.UNICODE flag.

Python Regex Matching Unicode Properties