Unicode equivalents for \w and \b in Java regular expressions?
Source code
The source code for the rewriting functions I discuss below is available here.
Update in Java 7
Sun’s updated Pattern
class for JDK7 has a marvelous new flag, UNICODE_CHARACTER_CLASS
, which makes everything work right again. It’s available as an embeddable (?U)
for inside the pattern, so you can use it with the String
class’s wrappers, too. It also sports corrected definitions for various other properties, too. It now tracks The Unicode Standard, in both RL1.2 and RL1.2a from UTS#18: Unicode Regular Expressions. This is an exciting and dramatic improvement, and the development team is to be commended for this important effort.
Java’s Regex Unicode Problems
The problem with Java regexes is that the Perl 1.0 charclass escapes — meaning \w
, \b
, \s
, \d
and their complements — are not in Java extended to work with Unicode. Alone amongst these, \b
enjoys certain extended semantics, but these map neither to \w
, nor to Unicode identifiers, nor to Unicode line-break properties.
Additionally, the POSIX properties in Java are accessed this way:
POSIX syntax Java syntax
[[:Lower:]] \p{Lower}
[[:Upper:]] \p{Upper}
[[:ASCII:]] \p{ASCII}
[[:Alpha:]] \p{Alpha}
[[:Digit:]] \p{Digit}
[[:Alnum:]] \p{Alnum}
[[:Punct:]] \p{Punct}
[[:Graph:]] \p{Graph}
[[:Print:]] \p{Print}
[[:Blank:]] \p{Blank}
[[:Cntrl:]] \p{Cntrl}
[[:XDigit:]] \p{XDigit}
[[:Space:]] \p{Space}
This is a real mess, because it means that things like Alpha
, Lower
, and Space
do not in Java map to the Unicode Alphabetic
, Lowercase
, or Whitespace
properties. This is exceeedingly annoying. Java’s Unicode property support is strictly antemillennial, by which I mean it supports no Unicode property that has come out in the last decade.
Not being able to talk about whitespace properly is super-annoying. Consider the following table. For each of those code points, there is both a J-results column
for Java and a P-results column for Perl or any other PCRE-based regex engine:
Regex 001A 0085 00A0 2029
J P J P J P J P
\s 1 1 0 1 0 1 0 1
\pZ 0 0 0 0 1 1 1 1
\p{Zs} 0 0 0 0 1 1 0 0
\p{Space} 1 1 0 1 0 1 0 1
\p{Blank} 0 0 0 0 0 1 0 0
\p{Whitespace} - 1 - 1 - 1 - 1
\p{javaWhitespace} 1 - 0 - 0 - 1 -
\p{javaSpaceChar} 0 - 0 - 1 - 1 -
See that?
Virtually every one of those Java white space results is ̲w̲r̲o̲n̲g̲ according to Unicode. It’s a really big problem. Java is just messed up, giving answers that are “wrong” according to existing practice and also according to Unicode. Plus Java doesn’t even give you access to the real Unicode properties! In fact, Java does not support any property that corresponds to Unicode whitespace.
The Solution to All Those Problems, and More
To deal with this and many other related problems, yesterday I wrote a Java function to rewrite a pattern string that rewrites these 14 charclass escapes:
\w \W \s \S \v \V \h \H \d \D \b \B \X \R
by replacing them with things that actually work to match Unicode in a predictable and consistent fashion. It’s only an alpha prototype from a single hack session, but it is completely functional.
The short story is that my code rewrites those 14 as follows:
\s => [\u0009-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000]
\S => [^\u0009-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000]
\v => [\u000A-\u000D\u0085\u2028\u2029]
\V => [^\u000A-\u000D\u0085\u2028\u2029]
\h => [\u0009\u0020\u00A0\u1680\u180E\u2000-\u200A\u202F\u205F\u3000]
\H => [^\u0009\u0020\u00A0\u1680\u180E\u2000\u2001-\u200A\u202F\u205F\u3000]
\w => [\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]
\W => [^\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]
\b => (?:(?<=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])(?![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])|(?<![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])(?=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]))
\B => (?:(?<=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])(?=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])|(?<![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])(?![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]))
\d => \p{Nd}
\D => \P{Nd}
\R => (?:(?>\u000D\u000A)|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029])
\X => (?>\PM\pM*)
Some things to consider...
That uses for its
\X
definition what Unicode now refers to as a legacy grapheme cluster, not an extended grapheme cluster, as the latter is rather more complicated. Perl itself now uses the fancier version, but the old version is still perfectly workable for the most common situations. EDIT: See addendum at bottom.What to do about
\d
depends on your intent, but the default is the Uniode definition. I can see people not always wanting\p{Nd}
, but sometimes either[0-9]
or\pN
.The two boundary definitions,
\b
and\B
, are specifically written to use the\w
definition.That
\w
definition is overly broad, because it grabs the parenned letters not just the circled ones. The UnicodeOther_Alphabetic
property isn’t available until JDK7, so that’s the best you can do.
Exploring Boundaries
Boundaries have been a problem ever since Larry Wall first coined the \b
and \B
syntax for talking about them for Perl 1.0 back in 1987. The key to understanding how \b
and \B
both work is to dispel two pervasive myths about them:
- They are only ever looking for
\w
word characters, never for non-word characters. - They do not specifically look for the edge of the string.
A \b
boundary means:
IF does follow word
THEN doesn't precede word
ELSIF doesn't follow word
THEN does precede word
And those are all defined perfectly straightforwardly as:
- follows word is
(?<=\w)
. - precedes word is
(?=\w)
. - doesn’t follow word is
(?<!\w)
. - doesn’t precede word is
(?!\w)
.
Therefore, since IF-THEN
is encoded as an and
ed-together AB
in regexes, an or
is X|Y
, and because the and
is higher in precedence than or
, that is simply AB|CD
. So every \b
that means a boundary can be safely replaced with:
(?:(?<=\w)(?!\w)|(?<!\w)(?=\w))
with the \w
defined in the appropriate way.
(You might think it strange that the A
and C
components are opposites. In a perfect world, you should be able to write that AB|D
, but for a while I was chasing down mutual exclusion contradictions in Unicode properties — which I think I’ve taken care of, but I left the double condition in the boundary just in case. Plus this makes it more extensible if you get extra ideas later.)
For the \B
non-boundaries, the logic is:
IF does follow word
THEN does precede word
ELSIF doesn't follow word
THEN doesn't precede word
Allowing all instances of \B
to be replaced with:
(?:(?<=\w)(?=\w)|(?<!\w)(?!\w))
This really is how \b
and \B
behave. Equivalent patterns for them are
\b
using the((IF)THEN|ELSE)
construct is(?(?<=\w)(?!\w)|(?=\w))
\B
using the((IF)THEN|ELSE)
construct is(?(?=\w)(?<=\w)|(?<!\w))
But the versions with just AB|CD
are fine, especially if you lack conditional patterns in your regex language — like Java. ☹
I’ve already verified the behaviour of the boundaries using all three equivalent definitions with a test suite that checks 110,385,408 matches per run, and which I've run on a dozen different data configurations according to:
0 .. 7F the ASCII range
80 .. FF the non-ASCII Latin1 range
100 .. FFFF the non-Latin1 BMP (Basic Multilingual Plane) range
10000 .. 10FFFF the non-BMP portion of Unicode (the "astral" planes)
However, people often want a different sort of boundary. They want something that is whitespace and edge-of-string aware:
- left edge as
(?:(?<=^)|(?<=\s))
- right edge as
(?=$|\s)
Fixing Java with Java
The code I posted in my other answer provides this and quite a few other conveniences. This includes definitions for natural-language words, dashes, hyphens, and apostrophes, plus a bit more.
It also allows you to specify Unicode characters in logical code points, not in idiotic UTF-16 surrogates. It’s hard to overstress how important that is! And that’s just for the string expansion.
For regex charclass substitution that makes the charclass in your Java regexes finally work on Unicode, and work correctly, grab the full source from here. You may do with it as you please, of course. If you make fixes to it, I’d love to hear of it, but you don’t have to. It’s pretty short. The guts of the main regex rewriting function is simple:
switch (code_point) {
case 'b': newstr.append(boundary);
break; /* switch */
case 'B': newstr.append(not_boundary);
break; /* switch */
case 'd': newstr.append(digits_charclass);
break; /* switch */
case 'D': newstr.append(not_digits_charclass);
break; /* switch */
case 'h': newstr.append(horizontal_whitespace_charclass);
break; /* switch */
case 'H': newstr.append(not_horizontal_whitespace_charclass);
break; /* switch */
case 'v': newstr.append(vertical_whitespace_charclass);
break; /* switch */
case 'V': newstr.append(not_vertical_whitespace_charclass);
break; /* switch */
case 'R': newstr.append(linebreak);
break; /* switch */
case 's': newstr.append(whitespace_charclass);
break; /* switch */
case 'S': newstr.append(not_whitespace_charclass);
break; /* switch */
case 'w': newstr.append(identifier_charclass);
break; /* switch */
case 'W': newstr.append(not_identifier_charclass);
break; /* switch */
case 'X': newstr.append(legacy_grapheme_cluster);
break; /* switch */
default: newstr.append('\\');
newstr.append(Character.toChars(code_point));
break; /* switch */
}
saw_backslash = false;
Anyway, that code is just an alpha release, stuff I hacked up over the weekend. It won’t stay that way.
For the beta I intend to:
fold together the code duplication
provide a clearer interface regarding unescaping string escapes versus augmenting regex escapes
provide some flexibility in the
\d
expansion, and maybe the\b
provide convenience methods that handle turning around and calling Pattern.compile or String.matches or whatnot for you
For production release, it should have javadoc and a JUnit test suite. I may include my gigatester, but it’s not written as JUnit tests.
Addendum
I have good news and bad news.
The good news is that I’ve now got a very close approximation to an extended grapheme cluster to use for an improved \X
.
The bad news ☺ is that that pattern is:
(?:(?:\u000D\u000A)|(?:[\u0E40\u0E41\u0E42\u0E43\u0E44\u0EC0\u0EC1\u0EC2\u0EC3\u0EC4\uAAB5\uAAB6\uAAB9\uAABB\uAABC]*(?:[\u1100-\u115F\uA960-\uA97C]+|([\u1100-\u115F\uA960-\uA97C]*((?:[[\u1160-\u11A2\uD7B0-\uD7C6][\uAC00\uAC1C\uAC38]][\u1160-\u11A2\uD7B0-\uD7C6]*|[\uAC01\uAC02\uAC03\uAC04])[\u11A8-\u11F9\uD7CB-\uD7FB]*))|[\u11A8-\u11F9\uD7CB-\uD7FB]+|[^[\p{Zl}\p{Zp}\p{Cc}\p{Cf}&&[^\u000D\u000A\u200C\u200D]]\u000D\u000A])[[\p{Mn}\p{Me}\u200C\u200D\u0488\u0489\u20DD\u20DE\u20DF\u20E0\u20E2\u20E3\u20E4\uA670\uA671\uA672\uFF9E\uFF9F][\p{Mc}\u0E30\u0E32\u0E33\u0E45\u0EB0\u0EB2\u0EB3]]*)|(?s:.))
which in Java you’d write as:
String extended_grapheme_cluster = "(?:(?:\\u000D\\u000A)|(?:[\\u0E40\\u0E41\\u0E42\\u0E43\\u0E44\\u0EC0\\u0EC1\\u0EC2\\u0EC3\\u0EC4\\uAAB5\\uAAB6\\uAAB9\\uAABB\\uAABC]*(?:[\\u1100-\\u115F\\uA960-\\uA97C]+|([\\u1100-\\u115F\\uA960-\\uA97C]*((?:[[\\u1160-\\u11A2\\uD7B0-\\uD7C6][\\uAC00\\uAC1C\\uAC38]][\\u1160-\\u11A2\\uD7B0-\\uD7C6]*|[\\uAC01\\uAC02\\uAC03\\uAC04])[\\u11A8-\\u11F9\\uD7CB-\\uD7FB]*))|[\\u11A8-\\u11F9\\uD7CB-\\uD7FB]+|[^[\\p{Zl}\\p{Zp}\\p{Cc}\\p{Cf}&&[^\\u000D\\u000A\\u200C\\u200D]]\\u000D\\u000A])[[\\p{Mn}\\p{Me}\\u200C\\u200D\\u0488\\u0489\\u20DD\\u20DE\\u20DF\\u20E0\\u20E2\\u20E3\\u20E4\\uA670\\uA671\\uA672\\uFF9E\\uFF9F][\\p{Mc}\\u0E30\\u0E32\\u0E33\\u0E45\\u0EB0\\u0EB2\\u0EB3]]*)|(?s:.))";
¡Tschüß!
Java regex for support Unicode?
What you are looking for are Unicode properties.
e.g. \p{L}
is any kind of letter from any language
So a regex to match such a Chinese word could be something like
\p{L}+
There are many such properties, for more details see regular-expressions.info
Another option is to use the modifier
Pattern.UNICODE_CHARACTER_CLASS
In Java 7 there is a new property Pattern.UNICODE_CHARACTER_CLASS
that enables the Unicode version of the predefined character classes see my answer here for some more details and links
You could do something like this
Pattern p = Pattern.compile("\\w+", Pattern.UNICODE_CHARACTER_CLASS);
and \w
would match all letters and all digits from any languages (and of course some word combining characters like _
).
Regular expression for unicode in java Dash version
If you need to match any non-word but space, you may use
reference = reference.replaceAll("[^\\w ]", "-");
Or, with character class subtraction:
reference = reference.replaceAll("[\\W&&[^ ]]", "-");
You can use the following pattern to match your hyphen or dash like patterns:
[\p{Pd}\u00AD\u2212]
Here,
\p{Pd}
- matches any Punctuation, Dash symbols\u00AD
- matches a soft hyphen\u2212
- matches a minus symbol.
Matching (e.g.) a Unicode letter with Java regexps
Here you have a very nice explanation:
http://www.regular-expressions.info/unicode.html
Some hints:
"Java and .NET unfortunately do not support \X
(yet). Use \P{M}\p{M}*
as a substitute. To match any number of graphemes, use (?:\P{M}\p{M}*)+
instead of \X+
."
"In Java, the regex token \uFFFF
only matches the specified code point, even when you turned on canonical equivalence. However, the same syntax \uFFFF
is also used to insert Unicode characters into literal strings in the Java source code. Pattern.compile("\u00E0")
will match both the single-code-point and double-code-point encodings of à
, while Pattern.compile("\\u00E0")
matches only the single-code-point version. Remember that when writing a regex as a Java string literal, backslashes must be escaped. The former Java code compiles the regex à
, while the latter compiles \u00E0
. Depending on what you're doing, the difference may be significant."
Use of \b Boundary Matcher In Java
\b
is what you can call an "anchor": it will match a position in the input text.
More specifically, \b
will match every position in the input text where:
- there is no preceding character and the following character is a word character (any letter or digit, or an underscore);
- there is no following character and the preceding character is a word character;
- the preceding character is a word character and the following character is not; or
- the following character is a word character and the preceding character is not.
For instance, the regex dog\b
in the text "my dog eats"
will match the position immediately after the g
of dog
(which is a word character) and before the following space (which is not).
Note that like all anchors, the fact that it matches a position means that it does not consume any input text.
Other anchors are ^
, $
, lookarounds.
Java regex doesnt match outside of ascii range, behaves different than python regex
As suggested by Wiktor in the comments, you could use (?U)
to turn on the flag UNICODE_CHARACTER_CLASS
. While this does allow matching äöa
, this still doesn't match m²
. That's because UNICODE_CHARACTER_CLASS
with \w
doesn't recognize ²
as a valid alphanumeric character. As a replacement for \w
, you can use [\pN\pL_]
. This matches Unicode numbers \pN
and Unicode letters \pL
(plus _
). The \pN
Unicode character class includes the \pNo
character class, which includes the Latin 1 Supplement - Latin-1 punctuation and symbols character class (it includes ²³¹
). Alternatively, you could just add the \pNo
Unicode character class to a character class with \w
. This means the following regular expressions correctly match your strings:
[\pN\pL_]{2,} # Matches any Unicode number or letter, and underscore
(?U)[\w\pNo]{2,} # Uses UNICODE_CHARACTER_CLASS so that \w matches Unicode.
# Adds \pNo to additionally match ²³¹
So why doesn't \w
match ²
in Java but it does in Python?
Java's interpretation
Looking at OpenJDK 8-b132's Pattern
implementation, we get the following information (I removed information irrelevant to answering the question):
Unicode support
The following Predefined Character classes and POSIX character
classes are in conformance with the recommendation of Annex C:
Compatibility Properties of Unicode Regular Expression, when
UNICODE_CHARACTER_CLASS
flag is specified.
\w
A word character:[\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}\p{IsJoin_Control}]
Great! Now we have a definition for \w
when the (?U)
flag is used. Plugging these Unicode character classes into this amazing tool will tell you exactly what each of these Unicode character classes match. Without making this post super long, I'll just go ahead and tell you that neither of the following classes matches ²
:
\p{Alpha}
\p{gc=Mn}
\p{gc=Me}
\p{gc=Mc}
\p{Digit}
\p{gc=Pc}
\p{IsJoin_Control}
Python's interpretation
So why does Python match ²³¹
when the u
flag is used in conjunction with \w
? This one was very difficult to track down, but I went digging into Python's source code (I used Python 3.6.5rc1 - 2018-03-13). After removing a lot of the fluff for how this gets called, basically the following happens:
\w
is defined asCATEGORY_UNI_WORD
, which is then prefixed withSRE_
.SRE_CATEGORY_UNI_WORD
callsSRE_UNI_IS_WORD(ch)
SRE_UNI_IS_WORD
is defined as(SRE_UNI_IS_ALNUM(ch) || (ch) == '_')
.SRE_UNI_IS_ALNUM
callsPy_UNICODE_ISALNUM
, which is, in turn, defined as(Py_UNICODE_ISALPHA(ch) || Py_UNICODE_ISDECIMAL(ch) || Py_UNICODE_ISDIGIT(ch) || Py_UNICODE_ISNUMERIC(ch))
.- The important one here is
Py_UNICODE_ISDECIMAL(ch)
, defined asPy_UNICODE_ISDECIMAL(ch) _PyUnicode_IsDecimalDigit(ch)
.
Now, let's take a look at the method _PyUnicode_IsDecimalDigit(ch)
:
int _PyUnicode_IsDecimalDigit(Py_UCS4 ch)
{
if (_PyUnicode_ToDecimalDigit(ch) < 0)
return 0;
return 1;
}
As we can see, this method returns 1
if _PyUnicode_ToDecimalDigit(ch) < 0
. So what does _PyUnicode_ToDecimalDigit
look like?
int _PyUnicode_ToDecimalDigit(Py_UCS4 ch)
{
const _PyUnicode_TypeRecord *ctype = gettyperecord(ch);
return (ctype->flags & DECIMAL_MASK) ? ctype->decimal : -1;
}
Great, so basically, if the character's UTF-32 encoded byte has the DECIMAL_MASK
flag this will evaluate to true and a value greater than or equal to 0
will be returned.
UTF-32 encoded byte value for ²
is 0x000000b2
and our flag DECIMAL_MASK
is 0x02
. 0x000000b2 & 0x02
evaluates to true and so ²
is deemed to be a valid Unicode alphanumeric character in python, thus \w
with u
flag matches ²
.
Is there a Unicode equivalent for `{\pGraph}` in Java / POSIX regular expressions?
[^\p{Z}\p{C}]
Pattern matching for Japanese string have issues in java
The thing is that 、
(U+3001 IDEOGRAPHIC COMMA
) belongs to "Punctuation, other" Unicode category and \\p{Punct}
only matches ASCII punctuation by default. If you use a Pattern.UNICODE_CHARACTER_CLASS
option or (?U)
embedded flag option, it will match (i.e. the pattern might look like "(?U)^[\\p{L}\\d\\s\\p{Punct}]{1,200}$"
). However, this may impact \d
and \s
, and I am not sure you want to match all Unicode digits and whitespace.
An alternative is to use \p{P}\p{S}
(to match Unicode punctuation and symbols) instead of \p{Punct}
(the POSIX character class matches both punctuation and symbols).
See a Java demo printing true:
private static final Pattern ADDRESS_STRING_PATTERN = Pattern.compile("^[\\p{L}\\d\\s\\p{P}\\p{S}]{1,200}$");
private static boolean isValidInput(final String input, Pattern pattern) {
return pattern.matcher(input).matches();
}
public static void main (String[] args) throws java.lang.Exception
{
System.out.println(isValidInput("こんにちは、元気ですか",ADDRESS_STRING_PATTERN));
}
// => true
Related Topics
How to Get the Current Date and Time in Utc or Gmt in Java
How to Find a Button Source in Awt (Calculator Homework)
Socket Using in a Swing Applet
What Is "String Args[]"? Parameter in Main Method Java
How to Add Local .Jar File Dependency to Build.Gradle File
Intellij Can't Recognize Javafx 11 With Openjdk 11
Accept Server'S Self-Signed Ssl Certificate in Java Client
How to Find All Subclasses of a Given Class in Java
Split String to Equal Length Substrings in Java
Java Error: Comparison Method Violates Its General Contract
Difference Between Jdk and Jre
Getting the Name of the Currently Executing Method
Are Getters and Setters Poor Design? Contradictory Advice Seen