Why regular expression for cyrillic letters misses a letter?
You can find ёЁ in cyrillic extension and not in А-Яа-я t
How to match Cyrillic characters with a regular expression
It depends on your regex flavor. If it supports Unicode character classes (like .NET, for instance), \p{L}
matches a letter character (in any character set).
Regular expression with the cyrillic alphabet
JavaScript (at least the versions most widely used) does not fully support Unicode. That is to say, \w
matches only Latin letters, decimal digits, and underscores ([a-zA-Z0-9_]
), and \b
matches the boundary the between a word character and and a non-word character.
To find all words in an input string using Latin or Cyrillic, you'd have to do something like this:
.match(/[\wа-я]+/ig); // where а is the Cyrillic а.
Or if you prefer:
.match(/[\w\u0430-\u044f]+/ig);
Of course this will probably mean you need to tweak your code a little bit, since here it will match all words rather than word boundaries. Note that [а-я]
matches any letter in the 'basic Cyrillic alphabet' as described here. To match letters outside of this range, you can modify the character set as necessary to include those letters, e.g. to also match the Russian Ё/ё, use [а-яё]
.
Also note that your triple-bracket pattern can be simplified to:
.replace(/\[{3}[^]]*]{3}/g, '')
Alternatively, you might want to look at the XRegExp project—which is an open-source project to add new features to the base JavaScript regular expression engine—and its Unicode addon.
Russian symbols in re (Python)
To use \w+
to match alphanumeric unicode characters you should pass both a unicode
pattern and unicode
text to re.findall
.
In Python2:
Assuming that you are reading bytes (not text) from the file, you should decode the bytes to obtain a
unicode
:uni = 'Привет, как дела?'.decode('utf-8')
ur'(?u)\w+'
is a raw unicode literal.
Even though it is not necessary here, using raw unicode/string literals for
regex patterns is generally a good practice -- it allows you to avoid the
need for double backslashes before certain characters such as\s
.The regex pattern
ur'(?u)\w+'
bakes-in the Unicode flag which tellsre.findall
to make\w
dependent on the Unicode character properties database.import re
uni = 'Привет, как дела?'.decode('utf-8')
print(re.findall(ur'(?u)\w+', uni))yields a list containing the 3 unicode "words":
[u'\u041f\u0440\u0438\u0432\u0435\u0442',
u'\u043a\u0430\u043a',
u'\u0434\u0435\u043b\u0430']In Python3:
The general principle is the same, except that what were
unicode
s in
Python2 are nowstr
s in Python3, and there is no longer any attempt at
automatic conversion between the two. So, again assuming that you are
reading bytes (not text) from the file, you should decode the bytes to
obtain astr
, and use astr
regex pattern:import re
uni = b'\xd0\x9f\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82, \xd0\xba\xd0\xb0\xd0\xba \xd0\xb4\xd0\xb5\xd0\xbb\xd0\xb0?'.decode('utf')
print(re.findall(r'(?u)\w+', uni))yields
['Привет', 'как', 'дела']
Python - Regex cyrillic mixed with latin
If you need to get all the Russian letters from your string, you need to use (?i)[А-ЯЁ]
regex (do not forget about Ё
as [А-Я]
range does not include it) and use it with re.findall
.
Tested in Python 3:
>>> import re
>>> input = "я я я я я w w w w w w\nф ф ф ф ф v v v v v v"
>>> output = re.findall(r'(?i)[А-ЯЁ]', input)
>>> print(output)
['я', 'я', 'я', 'я', 'я', 'ф', 'ф', 'ф', 'ф', 'ф']
To also extract Ukranian letters, you need to add ЇІЄҐ
to the character class:
output = re.findall(r"(?i)[А-ЯЁЇІЄҐ]", input)
An apostrophe is also considered a Ukrainan letter, no idea if you want to include it into the pattern.
RegEx for ukrainian letters. How to separate cyrillic words by capital letter?
[А-Я]
is not Cyrillic alphabet, it's just Russian!
Cyrillic is a writing system. It used in alphabets for many languages.
(Like Latin: charset for West European languages, East European &c.)
To have both Russian and Ukrainian you'd get [А-ЯҐЄІЇ]
.
To add Belarisian: [А-ЯҐЄІЇЎ]
And for all Cyrillic chars (including Balcanian languages and Old Cyrillic), you can get it through Unicode subset class, like: \p{IsCyrillic}
To deal with Ukrainian separately:
[А-ЩЬЮЯҐЄІЇ]
or [А-ЩЬЮЯҐЄІЇа-щьюяґєії]
seems to be full Ukrainian alphabet of 33 letters in each case.
Apostrophe is not a letter, but occasionally included in alphabet, because it has an impact to the next vowel.
Apostrophe is a part of the words, not divider. It may be displayed in a few ways:
27 "'" APOSTROPHE
60 "`" GRAVE ACCENT
2019 "’" RIGHT SINGLE QUOTATION MARK
2bc "ʼ" MODIFIER LETTER APOSTROPHE
and maybe some more.
Yes, it's a bit complicated with apostrophe. There is no common standard for it.
My regular expression code not working
You need to use
$('.form-control').textcomplete([{ words: ["россия","сша","англия","германия","google","git","github","php","microsoft","jquery"], match: /(^|[^\wа-яё])([\wа-яё]{2,})$/i, search: function (term, callback) { callback($.map(this.words, function (word) {return word.indexOf(term) === 0 ? word : null;})); }, index: 2, // THIS IS A DEFAULT VALUE replace: function (word) {return '$1' + word + ' ';}}]);
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script><link href="https://cdnjs.cloudflare.com/ajax/libs/twitter-bootstrap/3.3.2/css/bootstrap.min.css" rel="stylesheet"/><script src="https://cdnjs.cloudflare.com/ajax/libs/jquery.textcomplete/0.2.2/jquery.textcomplete.min.js"></script><textarea class="form-control" rows=5 cols=50></textarea>
Regex to check user login doesn't work
To match all Russian letters, just [А-Яа-я]
range is not enough. You need to also add the letter [ёЁ]
to the range since it is not inside that one.
Besides, the unescaped hyphen between literal symbols inside a character class creates a range, and it should be better put at the start or end of the character class.
To add restrictions like there must be at least N of something, you need to use anchored lookaheads.
var nameRegex = /^(?=[^A-ZА-ЯЁ]*[A-ZА-ЯЁ])(?=[^0-9]*[0-9])[-A-Z0-9А-ЯЁ.+~_!?*]+$/i;
Here is its demo
Here, ^
anchors the pattern at the start of the string, $
anchors it at the end, (?=[^A-ZА-ЯЁ]*[A-ZА-ЯЁ])
requires at least one letter, and (?=[^0-9]*[0-9])
requires at least one digit.
Note I removed all lowercase letters since there is a case-insensitive modifier /i
.
To only match the symbols from the list, use a plain +
quantifier:
var nameRegex = /^[-A-Z0-9А-ЯЁ.+~_!?*]+$/i;
^
If you allow an empty string, use *
instead of +
.
Related Topics
Grit's Clone Method Is Undefined
Fileutils.Mv Throwing Invalid Char \302 and \255 Exception
Rails 3 and PDFkit. How to Specify Page Size
How to Replace the Characters in a String
Convert Ip Address to 32 Bit Integer in Ruby
Require Command Not Working Within Bash Irb on Snow Leopard
Instance_Eval's Block Argument(S)- Documented? Purpose
Why Won't Ruby Allow Me to Specify Self as a Receiver Inside a Private Method
Google Analytics API Error "Selected Dimensions and Metrics Cannot Be Queried Together."
Trouble with Google Apps API and Service Accounts in Ruby
Importing CSV Data into Rails App, Using Something Other Then the Association "Id"
Git Bash Chcp Windows7 Encoding Issue
Using Ruby and Mechanize to Fill in a Remote Login Form Mystery
Unexpected Keyword_End, Expecting $End (Syntaxerror)