Regular Expression to Check If the String Has Chinese Chars

Php - regular expression to check if the string has chinese chars

You could use a unicode character class http://www.regular-expressions.info/unicode.html

preg_match("/\p{Han}+/u", $utf8_str);

This just checks for the presence of at least one chinese character. You might want to expand on this if you want to match the complete string.

How to detect Chinese character with punctuation in regex?

I suggest informing yourself by taking a look at Zhon, a Python library that provides constants commonly used in Chinese text processing.

Luckily, hanzi.py contains a definition of a regex that should pretty much suit your needs:

#: A regular expression pattern for a Chinese sentence. A sentence is defined
#: as a series of characters and non-stop punctuation marks followed by a stop
#: and zero or more container-closing punctuation marks (e.g. apostrophe or brackets).

sent = sentence = '[{characters}{radicals}{non_stops}]*{sentence_end}'.format(
characters=characters, radicals=radicals, non_stops=non_stops,
sentence_end=_sentence_end)

The definition above results in the following regex*:

[〇一-鿿㐀-䶿豈-﫿----⼀-⿕⺀-⻳"#$%&'()*+,-/:;<=>@[\]^_`{|}~⦅⦆「」、 、〃〈〉《》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏﹑﹔·]*[!?。。][」﹂”』’》)]}〕〗〙〛〉】]*

Code Example:

preg_match_all('/[〇一-鿿㐀-䶿豈-﫿----⼀-⿕⺀-⻳"#$%&'()*+,-/:;<=>@[\]^_`{|}~⦅⦆「」、 、〃〈〉《》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏﹑﹔·]*[!?。。][」﹂”』’》)]}〕〗〙〛〉】]*/', "我的中文不好。我是意大利人。你知道吗?", $matches, PREG_SET_ORDER, 0);
var_dump($matches);

If you prefer using Character code ranges for pertinent CJK ideograph Unicode blocks reference the Python source I have linked or get it from the Javascript sample below:

const regex = /[\u3007u4E00-\u9FFF\u3400-\u4DBF\uF900-\uFAFF\u20000-\u2A6DF\u2A700-\u2B73F\u2B740-\u2B81F\u0002F800-\u2FA1F\u2F00-\u2FD5\u2E80-\u2EF3\uFF02\uFF03\uFF04\uFF05\uFF06\uFF07\uFF08\uFF09\uFF0A\uFF0B\uFF0C\uFF0D\uFF0F\uFF1A\uFF1B\uFF1C\uFF1D\uFF1E\uFF20\uFF3B\uFF3C\uFF3D\uFF3E\uFF3F\uFF40\uFF5B\uFF5C\uFF5D\uFF5E\uFF5F\uFF60\uFF62\uFF63\uFF64\u3000\u3001\u3003\u3008\u3009\u300A\u300B\u300C\u300D\u300E\u300F\u3010\u3011\u3014\u3015\u3016\u3017\u3018\u3019\u301A\u301B\u301C\u301D\u301E\u301F\u3030\u303E\u303F\u2013\u2014\u2018\u2019\u201B\u201C\u201D\u201E\u201F\u2026\u2027\uFE4F\uFE51\uFE54\u00B7]*[\uFF01\uFF1F\uFF61\u3002][」﹂”』’》)]}〕〗〙〛〉】]*/gm;const str = `我的中文不好。我是意大利人。你知道吗?`;let m;
while ((m = regex.exec(str)) !== null) { // This is necessary to avoid infinite loops with zero-width matches if (m.index === regex.lastIndex) { regex.lastIndex++; } // The result can be accessed through the `m`-variable. m.forEach((match, groupIndex) => { console.log(`Found match, group ${groupIndex}: ${match}`); });}

Check whether a string contains Japanese/Chinese characters

The ranges of Unicode characters which are routinely used for Chinese and Japanese text are:

  • U+3040 - U+30FF: hiragana and katakana (Japanese only)
  • U+3400 - U+4DBF: CJK unified ideographs extension A (Chinese, Japanese, and Korean)
  • U+4E00 - U+9FFF: CJK unified ideographs (Chinese, Japanese, and Korean)
  • U+F900 - U+FAFF: CJK compatibility ideographs (Chinese, Japanese, and Korean)
  • U+FF66 - U+FF9F: half-width katakana (Japanese only)

As a regular expression, this would be expressed as:

/[\u3040-\u30ff\u3400-\u4dbf\u4e00-\u9fff\uf900-\ufaff\uff66-\uff9f]/

This does not include every character which will appear in Chinese and Japanese text, but any significant piece of typical Chinese or Japanese text will be mostly made up of characters from these ranges.

Note that this regular expression will also match on Korean text that contains hanja. This is an unavoidable result of Han unification.

Python: Check if a string contains chinese character?

The matched string should be unicode as well

>>> import re
>>> ipath= u"./data/NCDC/上海/虹桥/9705626661750dat.txt"
>>> re.findall(r'[\u4e00-\u9fff]+', ipath)
[u'\u4e0a\u6d77', u'\u8679\u6865']

Check if string contains CJK (chinese) characters

As discussed here, in Java 7 (i.e. regex compiler meets requirement RL1.2 Properties from UTS#18 Unicode Regular Expressions), you can use the following regex to match a Chinese (well, CJK) character:

\p{script=Han}

which can be appreviated to simply

\p{Han}

How to compare string with chinese characters using regex expression in sql server

I believe this should work to catch all Chinese characters, all basic Latin characters with no accents, and all western numerals. However, if I missed some, you can adjust it yourself - look at any Unicode table, and find the borders of the character blocks you want. Just like you can write A-Z to catch all characters from 65 to 90, you can also write ⺀-⿕ to catch all characters from 11904 to 12245. Ultimately it is up to you to decide what is a "special character". E.g. is Ⓐ ('CIRCLED LATIN CAPITAL LETTER A' (U+24B6)) a letter or a special character?

LIKE '%[a-zA-Z0-9⺀-⿕㐀-䶵一-俿]%';

How can I check if a string contains Chinese in Swift?

This answer
to How to determine if a character is a Chinese character can also easily be translated from
Ruby to Swift (now updated for Swift 3):

extension String {
var containsChineseCharacters: Bool {
return self.range(of: "\\p{Han}", options: .regularExpression) != nil
}
}

if myString.containsChineseCharacters {
print("Contains Chinese")
}

In a regular expression, "\p{Han}" matches all characters with the
"Han" Unicode property, which – as I understand it – are the characters
from the CJK languages.

Javascript unicode string, chinese character but no punctuation

You can see the relevant blocks at http://www.unicode.org/reports/tr38/#BlockListing or http://www.unicode.org/charts/ .

If you are excluding compatibility characters (ones which should no longer be used), as well as strokes, radicals, and Enclosed CJK Letters and Months, the following ought to cover it (I've added the individual JavaScript equivalent expressions afterward):

  • CJK Unified Ideographs (4E00-9FCC) [\u4E00-\u9FCC]
  • CJK Unified Ideographs Extension A (3400-4DB5) [\u3400-\u4DB5]
  • CJK Unified Ideographs Extension B (20000-2A6D6) [\ud840-\ud868][\udc00-\udfff]|\ud869[\udc00-\uded6]
  • CJK Unified Ideographs Extension C (2A700-2B734) \ud869[\udf00-\udfff]|[\ud86a-\ud86c][\udc00-\udfff]|\ud86d[\udc00-\udf34]
  • CJK Unified Ideographs Extension D (2B840-2B81D) \ud86d[\udf40-\udfff]|\ud86e[\udc00-\udc1d]
  • 12 characters within the CJK Compatibility Ideographs (F900-FA6D/FA70-FAD9) but which are actually CJK unified ideographs [\uFA0E\uFA0F\uFA11\uFA13\uFA14\uFA1F\uFA21\uFA23\uFA24\uFA27-\uFA29]

...so, a regex to grab the Chinese characters would be:

/[\u4E00-\u9FCC\u3400-\u4DB5\uFA0E\uFA0F\uFA11\uFA13\uFA14\uFA1F\uFA21\uFA23\uFA24\uFA27-\uFA29]|[\ud840-\ud868][\udc00-\udfff]|\ud869[\udc00-\uded6\udf00-\udfff]|[\ud86a-\ud86c][\udc00-\udfff]|\ud86d[\udc00-\udf34\udf40-\udfff]|\ud86e[\udc00-\udc1d]/

Due in fact to the many CJK (Chinese-Japanese-Korean) characters, Unicode was expanded to handle more characters beyond the "Basic Multilingual Plane" (called "astral" characters), and since the CJK Unified Ideographs extensions B-D are examples of such astral characters, those extensions have ranges that are more complicated because they have to be encoded using surrogate pairs in UTF-16 systems like JavaScript. A surrogate pair consists of a high surrogate and a low surrogate, neither of which is valid by itself but when joined together form an actual single character despite their string length being 2).

While it would probably be easier for replacement purposes to express this as the non-Chinese characters (to replace them with the empty string), I provided the expression for the Chinese characters instead so that it would be easier to track in case you needed to add or remove from the blocks.

Update September 2017

As of ES6, one may express the regular expressions without resorting to surrogates by using the "u" flag along with the code point inside of the new escape sequence with brackets, e.g., /^[\u{20000}-\u{2A6D6}]*$/u for "CJK Unified Ideographs Extension B".

Note that Unicode too has progressed to include "CJK Unified Ideographs Extension E" ([\u{2B820}-\u{2CEAF}]) and "CJK Unified Ideographs Extension F" ([\u{2CEB0}-\u{2EBEF}]).

For ES2018, it appears that Unicode property escapes will be able to simplify things even further. Per http://2ality.com/2017/07/regexp-unicode-property-escapes.html , it looks like will be able to do:

/^(\p{Block=CJK Unified Ideographs}|\p{Block=CJK Unified Ideographs Extension A}|\p{Block=CJK Unified Ideographs Extension B}|\p{Block=CJK Unified Ideographs Extension C}|\p{Block=CJK Unified Ideographs Extension D}|\p{Block=CJK Unified Ideographs Extension E}|\p{Block=CJK Unified Ideographs Extension F}|[\uFA0E\uFA0F\uFA11\uFA13\uFA14\uFA1F\uFA21\uFA23\uFA24\uFA27-\uFA29])+$/u

And as the shorter aliases from http://unicode.org/Public/UNIDATA/PropertyAliases.txt and http://unicode.org/Public/UNIDATA/PropertyValueAliases.txt can also be used for these blocks, you could shorten this to the following (and changing underscores to spaces or casing apparently too if desired):
/^(\p{Blk=CJK}|\p{Blk=CJK_Ext_A}|\p{Blk=CJK_Ext_B}|\p{Blk=CJK_Ext_C}|\p{Blk=CJK_Ext_D}|\p{Blk=CJK_Ext_E}|\p{Blk=CJK_Ext_F}|[\uFA0E\uFA0F\uFA11\uFA13\uFA14\uFA1F\uFA21\uFA23\uFA24\uFA27-\uFA29])+$/u

And if we wanted to improve readability, we could document the falsely labeled compatibility characters using named capture groups (see http://2ality.com/2017/05/regexp-named-capture-groups.html ):

/^(\p{Blk=CJK}|\p{Blk=CJK_Ext_A}|\p{Blk=CJK_Ext_B}|\p{Blk=CJK_Ext_C}|\p{Blk=CJK_Ext_D}|\p{Blk=CJK_Ext_E}|\p{Blk=CJK_Ext_F}|(?<CJKFalseCompatibilityUnifieds>[\uFA0E\uFA0F\uFA11\uFA13\uFA14\uFA1F\uFA21\uFA23\uFA24\uFA27-\uFA29]))+$/u

And as it looks per http://unicode.org/reports/tr44/#Unified_Ideograph like the "Unified_Ideograph" property (alias "UIdeo") covers all of our unified ideographs and excluding symbols/punctuation and compatibility characters, if you don't need to pick and choose out of the above, the following may be all you need:

/^\p{Unified_Ideograph=yes}*$/u

or in shorthand:

/^\p{UIdeo=y}*$/u

How to use regular expression to validate Chinese input?

From What's the complete range for Chinese characters in Unicode?, the CJK unicode ranges are:

Block                                   Range       Comment
--------------------------------------- ----------- ----------------------------------------------------
CJK Unified Ideographs 4E00-9FFF Common
CJK Unified Ideographs Extension A 3400-4DBF Rare
CJK Unified Ideographs Extension B 20000-2A6DF Rare, historic
CJK Unified Ideographs Extension C 2A700–2B73F Rare, historic
CJK Unified Ideographs Extension D 2B740–2B81F Uncommon, some in current use
CJK Unified Ideographs Extension E 2B820–2CEAF Rare, historic
CJK Compatibility Ideographs F900-FAFF Duplicates, unifiable variants, corporate characters
CJK Compatibility Ideographs Supplement 2F800-2FA1F Unifiable variants
CJK Symbols and Punctuation 3000-303F

You probably want to allow code points from the Unicode blocks CJK Unified Ideographs and CJK Unified Ideographs Extension A.

This regex will match 0 to 9 spaces, ideographic spaces (U+3000), A-Z letters, or code points in those 2 CJK blocks.

/^[ A-Za-z\u3000-\u303F\u3400-\u4DBF\u4E00-\u9FFF]{0,9}$/

The ideographs are listed in:

  • part 1
  • part 2
  • part 3
  • part 4
  • Extension A

However, you may as well add more blocks.

Code:

function has10OrLessCJK(text) {
return /^[ A-Za-z\u3000-\u303F\u3400-\u4DBF\u4E00-\u9FFF]{0,9}$/.test(text);
}

function checkValidation(value) {
var valid = document.getElementById("valid");
if (has10OrLessCJK(value)) {
valid.innerText = "Valid";
} else {
valid.innerText = "Invalid";
}
}
<input type="text" 
style="width:100%"
oninput="checkValidation(this.value)"
value="你的a你的a你的a">

<div id="valid">
Valid
</div>


Related Topics



Leave a reply



Submit