How to Include Ё in [А-Я] Regexp Char Interval

Why regular expression for cyrillic letters misses a letter?

You can find ёЁ in cyrillic extension and not in А-Яа-я t

How to match Cyrillic characters with a regular expression

It depends on your regex flavor. If it supports Unicode character classes (like .NET, for instance), \p{L} matches a letter character (in any character set).

Regular expression with the cyrillic alphabet

JavaScript (at least the versions most widely used) does not fully support Unicode. That is to say, \w matches only Latin letters, decimal digits, and underscores ([a-zA-Z0-9_]), and \b matches the boundary the between a word character and and a non-word character.

To find all words in an input string using Latin or Cyrillic, you'd have to do something like this:

.match(/[\wа-я]+/ig); // where а is the Cyrillic а.

Or if you prefer:

.match(/[\w\u0430-\u044f]+/ig);

Of course this will probably mean you need to tweak your code a little bit, since here it will match all words rather than word boundaries. Note that [а-я] matches any letter in the 'basic Cyrillic alphabet' as described here. To match letters outside of this range, you can modify the character set as necessary to include those letters, e.g. to also match the Russian Ё/ё, use [а-яё].

Also note that your triple-bracket pattern can be simplified to:

.replace(/\[{3}[^]]*]{3}/g, '')

Alternatively, you might want to look at the XRegExp project—which is an open-source project to add new features to the base JavaScript regular expression engine—and its Unicode addon.

Russian symbols in re (Python)

To use \w+ to match alphanumeric unicode characters you should pass both a unicode pattern and unicode text to re.findall.

  • In Python2:

    Assuming that you are reading bytes (not text) from the file, you should decode the bytes to obtain a unicode:

    uni = 'Привет, как дела?'.decode('utf-8')

    ur'(?u)\w+' is a raw unicode literal.
    Even though it is not necessary here, using raw unicode/string literals for
    regex patterns is generally a good practice -- it allows you to avoid the
    need for double backslashes before certain characters such as \s.

    The regex pattern ur'(?u)\w+' bakes-in the Unicode flag which tells re.findall to make \w dependent on the Unicode character properties database.

    import re
    uni = 'Привет, как дела?'.decode('utf-8')
    print(re.findall(ur'(?u)\w+', uni))

    yields a list containing the 3 unicode "words":

    [u'\u041f\u0440\u0438\u0432\u0435\u0442',
    u'\u043a\u0430\u043a',
    u'\u0434\u0435\u043b\u0430']
  • In Python3:

    The general principle is the same, except that what were unicodes in
    Python2 are now strs in Python3, and there is no longer any attempt at
    automatic conversion between the two. So, again assuming that you are
    reading bytes (not text) from the file, you should decode the bytes to
    obtain a str, and use a str regex pattern:

    import re
    uni = b'\xd0\x9f\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82, \xd0\xba\xd0\xb0\xd0\xba \xd0\xb4\xd0\xb5\xd0\xbb\xd0\xb0?'.decode('utf')
    print(re.findall(r'(?u)\w+', uni))

    yields

    ['Привет', 'как', 'дела']

Python - Regex cyrillic mixed with latin

If you need to get all the Russian letters from your string, you need to use (?i)[А-ЯЁ] regex (do not forget about Ё as [А-Я] range does not include it) and use it with re.findall.

Tested in Python 3:

>>> import re
>>> input = "я я я я я w w w w w w\nф ф ф ф ф v v v v v v"
>>> output = re.findall(r'(?i)[А-ЯЁ]', input)
>>> print(output)
['я', 'я', 'я', 'я', 'я', 'ф', 'ф', 'ф', 'ф', 'ф']

To also extract Ukranian letters, you need to add ЇІЄҐ to the character class:

 output = re.findall(r"(?i)[А-ЯЁЇІЄҐ]", input)

An apostrophe is also considered a Ukrainan letter, no idea if you want to include it into the pattern.

RegEx for ukrainian letters. How to separate cyrillic words by capital letter?

[А-Я] is not Cyrillic alphabet, it's just Russian!

Cyrillic is a writing system. It used in alphabets for many languages.
(Like Latin: charset for West European languages, East European &c.)

To have both Russian and Ukrainian you'd get [А-ЯҐЄІЇ].

To add Belarisian: [А-ЯҐЄІЇЎ]

And for all Cyrillic chars (including Balcanian languages and Old Cyrillic), you can get it through Unicode subset class, like: \p{IsCyrillic}


To deal with Ukrainian separately:

[А-ЩЬЮЯҐЄІЇ] or [А-ЩЬЮЯҐЄІЇа-щьюяґєії] seems to be full Ukrainian alphabet of 33 letters in each case.

Apostrophe is not a letter, but occasionally included in alphabet, because it has an impact to the next vowel.
Apostrophe is a part of the words, not divider. It may be displayed in a few ways:


27 "'" APOSTROPHE
60 "`" GRAVE ACCENT
2019 "’" RIGHT SINGLE QUOTATION MARK
2bc "ʼ" MODIFIER LETTER APOSTROPHE

and maybe some more.

Yes, it's a bit complicated with apostrophe. There is no common standard for it.

My regular expression code not working

You need to use

$('.form-control').textcomplete([{     words: ["россия","сша","англия","германия","google","git","github","php","microsoft","jquery"],     match: /(^|[^\wа-яё])([\wа-яё]{2,})$/i,    search: function (term, callback)     {         callback($.map(this.words, function (word) {return word.indexOf(term) === 0 ? word : null;}));    },    index: 2, // THIS IS A DEFAULT VALUE    replace: function (word) {return '$1' + word + ' ';}}]);
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script><link href="https://cdnjs.cloudflare.com/ajax/libs/twitter-bootstrap/3.3.2/css/bootstrap.min.css" rel="stylesheet"/><script src="https://cdnjs.cloudflare.com/ajax/libs/jquery.textcomplete/0.2.2/jquery.textcomplete.min.js"></script><textarea class="form-control" rows=5 cols=50></textarea>

Regex to check user login doesn't work

To match all Russian letters, just [А-Яа-я] range is not enough. You need to also add the letter [ёЁ] to the range since it is not inside that one.

Besides, the unescaped hyphen between literal symbols inside a character class creates a range, and it should be better put at the start or end of the character class.

To add restrictions like there must be at least N of something, you need to use anchored lookaheads.

var nameRegex = /^(?=[^A-ZА-ЯЁ]*[A-ZА-ЯЁ])(?=[^0-9]*[0-9])[-A-Z0-9А-ЯЁ.+~_!?*]+$/i;

Here is its demo

Here, ^ anchors the pattern at the start of the string, $ anchors it at the end, (?=[^A-ZА-ЯЁ]*[A-ZА-ЯЁ]) requires at least one letter, and (?=[^0-9]*[0-9]) requires at least one digit.

Note I removed all lowercase letters since there is a case-insensitive modifier /i.

To only match the symbols from the list, use a plain + quantifier:

var nameRegex = /^[-A-Z0-9А-ЯЁ.+~_!?*]+$/i;
^

If you allow an empty string, use * instead of +.



Related Topics



Leave a reply



Submit