How to Detect Chinese Character With Punctuation in Regex

How to detect Chinese character with punctuation in regex?

I suggest informing yourself by taking a look at Zhon, a Python library that provides constants commonly used in Chinese text processing.

Luckily, hanzi.py contains a definition of a regex that should pretty much suit your needs:

#: A regular expression pattern for a Chinese sentence. A sentence is defined
#: as a series of characters and non-stop punctuation marks followed by a stop
#: and zero or more container-closing punctuation marks (e.g. apostrophe or brackets).

sent = sentence = '[{characters}{radicals}{non_stops}]*{sentence_end}'.format(
characters=characters, radicals=radicals, non_stops=non_stops,
sentence_end=_sentence_end)

The definition above results in the following regex*:

[〇一-鿿㐀-䶿豈-﫿----⼀-⿕⺀-⻳"#$%&'()*+,-/:;<=>@[\]^_`{|}~⦅⦆「」、 、〃〈〉《》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏﹑﹔·]*[!?。。][」﹂”』’》)]}〕〗〙〛〉】]*

Code Example:

preg_match_all('/[〇一-鿿㐀-䶿豈-﫿----⼀-⿕⺀-⻳"#$%&'()*+,-/:;<=>@[\]^_`{|}~⦅⦆「」、 、〃〈〉《》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏﹑﹔·]*[!?。。][」﹂”』’》)]}〕〗〙〛〉】]*/', "我的中文不好。我是意大利人。你知道吗?", $matches, PREG_SET_ORDER, 0);
var_dump($matches);

If you prefer using Character code ranges for pertinent CJK ideograph Unicode blocks reference the Python source I have linked or get it from the Javascript sample below:

const regex = /[\u3007u4E00-\u9FFF\u3400-\u4DBF\uF900-\uFAFF\u20000-\u2A6DF\u2A700-\u2B73F\u2B740-\u2B81F\u0002F800-\u2FA1F\u2F00-\u2FD5\u2E80-\u2EF3\uFF02\uFF03\uFF04\uFF05\uFF06\uFF07\uFF08\uFF09\uFF0A\uFF0B\uFF0C\uFF0D\uFF0F\uFF1A\uFF1B\uFF1C\uFF1D\uFF1E\uFF20\uFF3B\uFF3C\uFF3D\uFF3E\uFF3F\uFF40\uFF5B\uFF5C\uFF5D\uFF5E\uFF5F\uFF60\uFF62\uFF63\uFF64\u3000\u3001\u3003\u3008\u3009\u300A\u300B\u300C\u300D\u300E\u300F\u3010\u3011\u3014\u3015\u3016\u3017\u3018\u3019\u301A\u301B\u301C\u301D\u301E\u301F\u3030\u303E\u303F\u2013\u2014\u2018\u2019\u201B\u201C\u201D\u201E\u201F\u2026\u2027\uFE4F\uFE51\uFE54\u00B7]*[\uFF01\uFF1F\uFF61\u3002][」﹂”』’》)]}〕〗〙〛〉】]*/gm;const str = `我的中文不好。我是意大利人。你知道吗?`;let m;
while ((m = regex.exec(str)) !== null) { // This is necessary to avoid infinite loops with zero-width matches if (m.index === regex.lastIndex) { regex.lastIndex++; } // The result can be accessed through the `m`-variable. m.forEach((match, groupIndex) => { console.log(`Found match, group ${groupIndex}: ${match}`); });}

Regex to replace chinese punctuation with English comma in Python

One way to do with re module

import re
str='上海,北京、武汉;重庆。欢迎你!你好'
s = re.sub(r'[^\w\s]',',',str)
print(s)

Output:

上海,北京,武汉,重庆,欢迎你,你好

Explanation,

[^\w\s]- Match a single character Not present in the list below-

1. \w matches any word character (equal to [a-zA-Z0-9_])
2. \s matches any whitespace character (equal to [\r\n\t\f\v ])

Regular expression to match and split on chinese comma in JavaScript

An ASCII comma won't match the comma you have in Chinese text. Either replace the ASCII comma (\x2C) with the Chinese one (\uFF0C), or use a character class [,,] to match both:

var str = "继续,取消   继续 ,取消";console.log(str.split(/\s*[,,]\s*/));

How to Check If The Rune is Chinese Punctuation Character in Go

Puctuation marks are scattered about in different Unicode code blocks.



The Unicode® Standard

Version 14.0 – Core Specification

Chapter 6

Writing Systems and Punctuation

https://www.unicode.org/versions/latest/ch06.pdf

Punctuation. The rest of this chapter deals with a special case: punctuation marks, which tend to be scattered about in different blocks and which may be used in common by many scripts. Punctuation characters occur in several widely separated places in the blocks, including Basic Latin, Latin-1 Supplement, General Punctuation, Supplemental Punctuation, and CJK Symbols and Punctuation. There are also occasional punctuation characters in blocks for specific scripts.


Here are two of your examples,

〜 Wave Dash U+301C

。Ideographic Full Stop U+3002



package main

import (
"fmt"
"unicode"
)

func main() {
// CJK Symbols and Punctuation Unicode block
for r := rune('\u3000'); r <= '\u303F'; r++ {
if unicode.IsPunct(r) {
fmt.Printf("%[1]U\t%[1]c\n", r)
}
}
}

https://go.dev/play/p/WoJjM6JKTYR

U+3001  、
U+3002 。
U+3003 〃
U+3008 〈
U+3009 〉
U+300A 《
U+300B 》
U+300C 「
U+300D 」
U+300E 『
U+300F 』
U+3010 【
U+3011 】
U+3014 〔
U+3015 〕
U+3016 〖
U+3017 〗
U+3018 〘
U+3019 〙
U+301A 〚
U+301B 〛
U+301C 〜
U+301D 〝
U+301E 〞
U+301F 〟
U+3030 〰
U+303D 〽

Javascript unicode string, chinese character but no punctuation

You can see the relevant blocks at http://www.unicode.org/reports/tr38/#BlockListing or http://www.unicode.org/charts/ .

If you are excluding compatibility characters (ones which should no longer be used), as well as strokes, radicals, and Enclosed CJK Letters and Months, the following ought to cover it (I've added the individual JavaScript equivalent expressions afterward):

  • CJK Unified Ideographs (4E00-9FCC) [\u4E00-\u9FCC]
  • CJK Unified Ideographs Extension A (3400-4DB5) [\u3400-\u4DB5]
  • CJK Unified Ideographs Extension B (20000-2A6D6) [\ud840-\ud868][\udc00-\udfff]|\ud869[\udc00-\uded6]
  • CJK Unified Ideographs Extension C (2A700-2B734) \ud869[\udf00-\udfff]|[\ud86a-\ud86c][\udc00-\udfff]|\ud86d[\udc00-\udf34]
  • CJK Unified Ideographs Extension D (2B840-2B81D) \ud86d[\udf40-\udfff]|\ud86e[\udc00-\udc1d]
  • 12 characters within the CJK Compatibility Ideographs (F900-FA6D/FA70-FAD9) but which are actually CJK unified ideographs [\uFA0E\uFA0F\uFA11\uFA13\uFA14\uFA1F\uFA21\uFA23\uFA24\uFA27-\uFA29]

...so, a regex to grab the Chinese characters would be:

/[\u4E00-\u9FCC\u3400-\u4DB5\uFA0E\uFA0F\uFA11\uFA13\uFA14\uFA1F\uFA21\uFA23\uFA24\uFA27-\uFA29]|[\ud840-\ud868][\udc00-\udfff]|\ud869[\udc00-\uded6\udf00-\udfff]|[\ud86a-\ud86c][\udc00-\udfff]|\ud86d[\udc00-\udf34\udf40-\udfff]|\ud86e[\udc00-\udc1d]/

Due in fact to the many CJK (Chinese-Japanese-Korean) characters, Unicode was expanded to handle more characters beyond the "Basic Multilingual Plane" (called "astral" characters), and since the CJK Unified Ideographs extensions B-D are examples of such astral characters, those extensions have ranges that are more complicated because they have to be encoded using surrogate pairs in UTF-16 systems like JavaScript. A surrogate pair consists of a high surrogate and a low surrogate, neither of which is valid by itself but when joined together form an actual single character despite their string length being 2).

While it would probably be easier for replacement purposes to express this as the non-Chinese characters (to replace them with the empty string), I provided the expression for the Chinese characters instead so that it would be easier to track in case you needed to add or remove from the blocks.

Update September 2017

As of ES6, one may express the regular expressions without resorting to surrogates by using the "u" flag along with the code point inside of the new escape sequence with brackets, e.g., /^[\u{20000}-\u{2A6D6}]*$/u for "CJK Unified Ideographs Extension B".

Note that Unicode too has progressed to include "CJK Unified Ideographs Extension E" ([\u{2B820}-\u{2CEAF}]) and "CJK Unified Ideographs Extension F" ([\u{2CEB0}-\u{2EBEF}]).

For ES2018, it appears that Unicode property escapes will be able to simplify things even further. Per http://2ality.com/2017/07/regexp-unicode-property-escapes.html , it looks like will be able to do:

/^(\p{Block=CJK Unified Ideographs}|\p{Block=CJK Unified Ideographs Extension A}|\p{Block=CJK Unified Ideographs Extension B}|\p{Block=CJK Unified Ideographs Extension C}|\p{Block=CJK Unified Ideographs Extension D}|\p{Block=CJK Unified Ideographs Extension E}|\p{Block=CJK Unified Ideographs Extension F}|[\uFA0E\uFA0F\uFA11\uFA13\uFA14\uFA1F\uFA21\uFA23\uFA24\uFA27-\uFA29])+$/u

And as the shorter aliases from http://unicode.org/Public/UNIDATA/PropertyAliases.txt and http://unicode.org/Public/UNIDATA/PropertyValueAliases.txt can also be used for these blocks, you could shorten this to the following (and changing underscores to spaces or casing apparently too if desired):
/^(\p{Blk=CJK}|\p{Blk=CJK_Ext_A}|\p{Blk=CJK_Ext_B}|\p{Blk=CJK_Ext_C}|\p{Blk=CJK_Ext_D}|\p{Blk=CJK_Ext_E}|\p{Blk=CJK_Ext_F}|[\uFA0E\uFA0F\uFA11\uFA13\uFA14\uFA1F\uFA21\uFA23\uFA24\uFA27-\uFA29])+$/u

And if we wanted to improve readability, we could document the falsely labeled compatibility characters using named capture groups (see http://2ality.com/2017/05/regexp-named-capture-groups.html ):

/^(\p{Blk=CJK}|\p{Blk=CJK_Ext_A}|\p{Blk=CJK_Ext_B}|\p{Blk=CJK_Ext_C}|\p{Blk=CJK_Ext_D}|\p{Blk=CJK_Ext_E}|\p{Blk=CJK_Ext_F}|(?<CJKFalseCompatibilityUnifieds>[\uFA0E\uFA0F\uFA11\uFA13\uFA14\uFA1F\uFA21\uFA23\uFA24\uFA27-\uFA29]))+$/u

And as it looks per http://unicode.org/reports/tr44/#Unified_Ideograph like the "Unified_Ideograph" property (alias "UIdeo") covers all of our unified ideographs and excluding symbols/punctuation and compatibility characters, if you don't need to pick and choose out of the above, the following may be all you need:

/^\p{Unified_Ideograph=yes}*$/u

or in shorthand:

/^\p{UIdeo=y}*$/u

Match string that contains punctuations, emojis, special characters, some Chinese characters and alpha numeric

You could use such pattern ([^\/]+)\/\1Version.+

Pattern explanation:

([^\/]+) - [^\/]+ matches on or more characters other than / (this is negated character class), () means capturing group, so matched text is put into first capturing group

\/ - match / literally

\1 - back reference to match the same text as was matched by first capturing group

Version - match Version literally

.+ - match one or more of any characters (to match rest of a string - this is optional and can be removed)

Regex demo

Update

To match updated requirements, you should use ([^\/]+)\/\d[a-zA-Z\d.-]+

What's new is:

[a-zA-Z\d.-]+ - match on or more characters from set a-z (lowercase letters), A-Z (uppercase letters), \d (digits), .- - hyphen or dot

Updated demo

How to use regular expressions to deal with Chinese punctuation symbols in C++

Assuming you did mean to use a regular expression, rather than a character-by-character replacement function... Here's what I meant by using std::regex_replace. There's probably a more elegant regex that generalizes with fewer surprises, but at least this works for your example.

#include <regex>
#include <string>

int main()
{
std::wstring s(L"有人可能会问:“那情绪、欲望、冲动、强迫症有什么区别呢?”");

// Replace each run of punctuation with a space; use ECMAScript grammar
s = std::regex_replace(s, std::wregex(L"[[:punct:]]+"), L" ");

// Remove extra space at ends of line
s = std::regex_replace(s, std::wregex(L"^ | $"), L"");

return (s != L"有人可能会问 那情绪 欲望 冲动 强迫症有什么区别呢"); // returns 0
}

RegEx for all letters (including Chinese, Greek, etc.)

Have you given XRegExp and the Unicode plugin a try/look?

<script src="xregexp.js"></script>
<script src="xregexp-unicode.js"></script>
<script>
var unicodeWord = XRegExp("^\\p{L}+$");
alert(unicodeWord.test("Ниндзя")); // -> true
</script>


Related Topics



Leave a reply



Submit