How to Use Unicode Range in C++ Regex

How to use Unicode range in C++ regex

This should work fine but you need to use std::wregex and std::wsmatch. You will need to convert the source string and regular expression to wide character unicode (UTF-32 on Linux, UTF-16(ish) on Windows) to make it work.

This works for me where source text is UTF-8:

inline std::wstring from_utf8(const std::string& utf8)
{
// code to convert from utf8 to utf32/utf16
}

inline std::string to_utf8(const std::wstring& ws)
{
// code to convert from utf32/utf16 to utf8
}

int main()
{
std::string test = "john.doe@神谕.com"; // utf8
std::string expr = "[\\u0080-\\uDB7F]+"; // utf8

std::wstring wtest = from_utf8(test);
std::wstring wexpr = from_utf8(expr);

std::wregex we(wexpr);
std::wsmatch wm;
if(std::regex_search(wtest, wm, we))
{
std::cout << to_utf8(wm.str(0)) << '\n';
}
}

Output:

神谕

Note: If you need a UTF conversion library I used THIS ONE in the example above.

Edit: Or, you could use the functions given in this answer:

Any good solutions for C++ string code point and code unit?

Do C++11 regular expressions work with UTF-8 strings?

You would need to test your compiler and the system you are using, but in theory, it will be supported if your system has a UTF-8 locale. The following test returned true for me on Clang/OS X.

bool test_unicode()
{
std::locale old;
std::locale::global(std::locale("en_US.UTF-8"));

std::regex pattern("[[:alpha:]]+", std::regex_constants::extended);
bool result = std::regex_match(std::string("abcdéfg"), pattern);

std::locale::global(old);

return result;
}

NOTE: This was compiled in a file what was UTF-8 encoded.


Just to be safe I also used a string with the explicit hex versions. It worked also.

bool test_unicode2()
{
std::locale old;
std::locale::global(std::locale("en_US.UTF-8"));

std::regex pattern("[[:alpha:]]+", std::regex_constants::extended);
bool result = std::regex_match(std::string("abcd\xC3\xA9""fg"), pattern);

std::locale::global(old);

return result;
}

Update test_unicode() still works for me

$ file regex-test.cpp 
regex-test.cpp: UTF-8 Unicode c program text

$ g++ --version
Configured with: --prefix=/Applications/Xcode-8.2.1.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 8.0.0 (clang-800.0.42.1)
Target: x86_64-apple-darwin15.6.0
Thread model: posix
InstalledDir: /Applications/Xcode-8.2.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

Range of UTF-8 Characters in C++11 Regex

Encoded in UTF-8, the string "[一-龠々〆ヵヶ]" is equal to this one: "[\xe4\xb8\x80-\xe9\xbe\xa0\xe3\x80\x85\xe3\x80\x86\xe3\x83\xb5\xe3\x83\xb6]". And this is not the droid character class you are looking for.

The character class you are looking for is the one that includes:

  • any character in the range U+4E00..U+9FA0; or
  • any of the characters 々, 〆, ヵ, ヶ.

The character class you specified is the one that includes:

  • any of the "characters" \xe4 or \xb8; or
  • any "character" in the range \x80..\xe9; or
  • any of the "characters" \xbe, \xa0, \xe3, \x80, \x85, \xe3 (again), \x80 (again), \x86, \xe3 (again), \x83, \xb5, \xe3 (again), \x83 (again), \xb6.

Messy isn't it? Do you see the problem?

This will not match "latin" characters (which I assume you mean things like a-z) because in UTF-8 those all use a single byte below 0x80, and none of those is in that messy character class.

It will not match "中" either because "中" has three "characters", and your regex matches only one "character" out of that weird long list. Try assert(std::regex_match("中", std::regex("..."))) and you will see.

If you add a + it works because "中" has three of those "characters" in your weird long list, and now your regex matches one or more.

If you instead add {1} it does not match because we are back to matching three "characters" against one.

Incidentally "中" matches "中" because we are matching the three "characters" against the same three "characters" in the same order.

That the regex with + will actually match some undesired things because it does not care about order. Any character that can be made from that list of bytes in UTF-8 will match. It will match "\xe3\x81\x81" (ぁ U+3041) and it will even match invalid UTF-8 input like "\xe3\xe3\xe3\xe3".

The bigger problem is that you are using a regex library that does not even have level 1 support for Unicode, the bare minimum required. It munges bytes and there isn't much your precious tiny regex can do about it.

And the even bigger problem is that you are using a hardcoded set of characters to specify "any Japanese Kanji or Chinese character". Why not use the Unicode Script property for that?

R"(\p{Script=Han})"

Oh right, this won't work with C++11 regexes. For a moment there I almost forgot those are annoyingly worse than useless with Unicode.

So what should you do?

You could decode your input into a std::u32string and use char32_t all over for the matching. That would not give you this mess, but you would still be hardcoding ranges and exceptions when you mean "a set of characters that share a certain property".

I recommend you forget about C++11 regexes and use some regular expression library that has the bare minimum level 1 Unicode support, like the one in ICU.

Unicode character range not being consumed by Regex

While the other contributors to this question provided some clues, I needed an answer. My test is a rules engine that is driven by a regex that is built up from file input, so hard coding the logic into C# is not an option.

However, I did learn here that

  1. the .NET Regex class does not support surrogate pairs and
  2. you can fake support for surrogate pair ranges by using regex alteration

But of course, in my data-driven case I can't manually change the regexes to a format that .NET will accept - I need to automate it. So, I created the below Utf32Regex class that accepts UTF32 characters directly in the constructor and internally converts them to regexes that .NET understands.

For example, it will convert the regex

"[abc\\U00011DEF-\\U00013E07]"

To

"(?:[abc]|\\uD807[\\uDDEF-\\uDFFF]|[\\uD808-\\uD80E][\\uDC00-\\uDFFF]|\\uD80F[\\uDC00-\\uDE07])"

Or

"([\\u0000-\\u0009\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F\\u00AD" +
"\\u061C\\u180E\\u200B\\u200E\\u200F\\u2028-\\u202E\\u2060-\\u206F\\uD800-" +
"\\uDFFF\\uFEFF\\uFFF0-\\uFFFB\\U0001BCA0-\\U0001BCA3\\U0001D173-" +
"\\U0001D17A\\U000E0000-\\U000E001F\\U000E0080-\\U000E00FF\\U000E01F0-\\U000E0FFF] " +
"| [\\u000D] | [\\u000A]) ()"

To

"((?:[\\u0000-\\u0009\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F\\u00AD\\u061C\\u180E" + 
"\\u200B\\u200E\\u200F\\u2028-\\u202E\\u2060-\\u206F\\uD800-\\uDFFF\\uFEFF\\uFFF0-\\uFFFB]|" +
"\\uD82F[\\uDCA0-\\uDCA3]|\\uD834[\\uDD73-\\uDD7A]|\\uDB40[\\uDC00-\\uDC1F]|" +
"\\uDB40[\\uDC80-\\uDCFF]|\\uDB40[\\uDDF0-\\uDFFF]|[\\uDB41-\\uDB42][\\uDC00-\\uDFFF]|" +
"\\uDB43[\\uDC00-\\uDFFF]) | [\\u000D] | [\\u000A]) ()"

Utf32Regex.cs

using System;
using System.Globalization;
using System.Text;
using System.Text.RegularExpressions;

/// <summary>
/// Patches the <see cref="Regex"/> class so it will automatically convert and interpret
/// UTF32 characters expressed like <c>\U00010000</c> or UTF32 ranges expressed
/// like <c>\U00010000-\U00010001</c>.
/// </summary>
public class Utf32Regex : Regex
{
private const char MinLowSurrogate = '\uDC00';
private const char MaxLowSurrogate = '\uDFFF';

private const char MinHighSurrogate = '\uD800';
private const char MaxHighSurrogate = '\uDBFF';

// Match any character class such as [A-z]
private static readonly Regex characterClass = new Regex(
"(?<!\\\\)(\\[.*?(?<!\\\\)\\])",
RegexOptions.Compiled);

// Match a UTF32 range such as \U000E01F0-\U000E0FFF
// or an individual character such as \U000E0FFF
private static readonly Regex utf32Range = new Regex(
"(?<begin>\\\\U(?:00)?[0-9A-Fa-f]{6})-(?<end>\\\\U(?:00)?[0-9A-Fa-f]{6})|(?<begin>\\\\U(?:00)?[0-9A-Fa-f]{6})",
RegexOptions.Compiled);

public Utf32Regex()
: base()
{
}

public Utf32Regex(string pattern)
: base(ConvertUTF32Characters(pattern))
{
}

public Utf32Regex(string pattern, RegexOptions options)
: base(ConvertUTF32Characters(pattern), options)
{
}

public Utf32Regex(string pattern, RegexOptions options, TimeSpan matchTimeout)
: base(ConvertUTF32Characters(pattern), options, matchTimeout)
{
}

private static string ConvertUTF32Characters(string regexString)
{
StringBuilder result = new StringBuilder();
// Convert any UTF32 character ranges \U00000000-\U00FFFFFF to their
// equivalent UTF16 characters
ConvertUTF32CharacterClassesToUTF16Characters(regexString, result);
// Now find all of the individual characters that were not in ranges and
// fix those as well.
ConvertUTF32CharactersToUTF16(result);

return result.ToString();
}

private static void ConvertUTF32CharacterClassesToUTF16Characters(string regexString, StringBuilder result)
{
Match match = characterClass.Match(regexString); // Reset
int lastEnd = 0;
if (match.Success)
{
do
{
string characterClass = match.Groups[1].Value;
string convertedCharacterClass = ConvertUTF32CharacterRangesToUTF16Characters(characterClass);

result.Append(regexString.Substring(lastEnd, match.Index - lastEnd)); // Remove the match
result.Append(convertedCharacterClass); // Append replacement

lastEnd = match.Index + match.Length;
} while ((match = match.NextMatch()).Success);
}
result.Append(regexString.Substring(lastEnd)); // Append tail
}

private static string ConvertUTF32CharacterRangesToUTF16Characters(string characterClass)
{
StringBuilder result = new StringBuilder();
StringBuilder chars = new StringBuilder();

Match match = utf32Range.Match(characterClass); // Reset
int lastEnd = 0;
if (match.Success)
{
do
{
string utf16Chars;
string rangeBegin = match.Groups["begin"].Value.Substring(2);

if (!string.IsNullOrEmpty(match.Groups["end"].Value))
{
string rangeEnd = match.Groups["end"].Value.Substring(2);
utf16Chars = UTF32RangeToUTF16Chars(rangeBegin, rangeEnd);
}
else
{
utf16Chars = UTF32ToUTF16Chars(rangeBegin);
}

result.Append(characterClass.Substring(lastEnd, match.Index - lastEnd)); // Remove the match
chars.Append(utf16Chars); // Append replacement

lastEnd = match.Index + match.Length;
} while ((match = match.NextMatch()).Success);
}
result.Append(characterClass.Substring(lastEnd)); // Append tail of character class

// Special case - if we have removed all of the contents of the
// character class, we need to remove the square brackets and the
// alternation character |
int emptyCharClass = result.IndexOf("[]");
if (emptyCharClass >= 0)
{
result.Remove(emptyCharClass, 2);
// Append replacement ranges (exclude beginning |)
result.Append(chars.ToString(1, chars.Length - 1));
}
else
{
// Append replacement ranges
result.Append(chars.ToString());
}

if (chars.Length > 0)
{
// Wrap both the character class and any UTF16 character alteration into
// a non-capturing group.
return "(?:" + result.ToString() + ")";
}
return result.ToString();
}

private static void ConvertUTF32CharactersToUTF16(StringBuilder result)
{
while (true)
{
int where = result.IndexOf("\\U00");
if (where < 0)
{
break;
}
string cp = UTF32ToUTF16Chars(result.ToString(where + 2, 8));
result.Replace(where, where + 10, cp);
}
}

private static string UTF32RangeToUTF16Chars(string hexBegin, string hexEnd)
{
var result = new StringBuilder();
int beginCodePoint = int.Parse(hexBegin, NumberStyles.HexNumber);
int endCodePoint = int.Parse(hexEnd, NumberStyles.HexNumber);

var beginChars = char.ConvertFromUtf32(beginCodePoint);
var endChars = char.ConvertFromUtf32(endCodePoint);
int beginDiff = endChars[0] - beginChars[0];

if (beginDiff == 0)
{
// If the begin character is the same, we can just use the syntax \uD807[\uDDEF-\uDFFF]
result.Append("|");
AppendUTF16Character(result, beginChars[0]);
result.Append('[');
AppendUTF16Character(result, beginChars[1]);
result.Append('-');
AppendUTF16Character(result, endChars[1]);
result.Append(']');
}
else
{
// If the begin character is not the same, create 3 ranges
// 1. The remainder of the first
// 2. A range of all of the middle characters
// 3. The beginning of the last

result.Append("|");
AppendUTF16Character(result, beginChars[0]);
result.Append('[');
AppendUTF16Character(result, beginChars[1]);
result.Append('-');
AppendUTF16Character(result, MaxLowSurrogate);
result.Append(']');

// We only need a middle range if the ranges are not adjacent
if (beginDiff > 1)
{
result.Append("|");
// We only need a character class if there are more than 1
// characters in the middle range
if (beginDiff > 2)
{
result.Append('[');
}
AppendUTF16Character(result, (char)(Math.Min(beginChars[0] + 1, MaxHighSurrogate)));
if (beginDiff > 2)
{
result.Append('-');
AppendUTF16Character(result, (char)(Math.Max(endChars[0] - 1, MinHighSurrogate)));
result.Append(']');
}
result.Append('[');
AppendUTF16Character(result, MinLowSurrogate);
result.Append('-');
AppendUTF16Character(result, MaxLowSurrogate);
result.Append(']');
}

result.Append("|");
AppendUTF16Character(result, endChars[0]);
result.Append('[');
AppendUTF16Character(result, MinLowSurrogate);
result.Append('-');
AppendUTF16Character(result, endChars[1]);
result.Append(']');
}
return result.ToString();
}

private static string UTF32ToUTF16Chars(string hex)
{
int codePoint = int.Parse(hex, NumberStyles.HexNumber, CultureInfo.InvariantCulture);
return UTF32ToUTF16Chars(codePoint);
}

private static string UTF32ToUTF16Chars(int codePoint)
{
StringBuilder result = new StringBuilder();
UTF32ToUTF16Chars(codePoint, result);
return result.ToString();
}

private static void UTF32ToUTF16Chars(int codePoint, StringBuilder result)
{
// Use regex alteration to on the entire range of UTF32 code points
// to ensure each one is treated as a group.
result.Append("|");
AppendUTF16CodePoint(result, codePoint);
}

private static void AppendUTF16CodePoint(StringBuilder text, int cp)
{
var chars = char.ConvertFromUtf32(cp);
AppendUTF16Character(text, chars[0]);
if (chars.Length == 2)
{
AppendUTF16Character(text, chars[1]);
}
}

private static void AppendUTF16Character(StringBuilder text, char c)
{
text.Append(@"\u");
text.Append(Convert.ToString(c, 16).ToUpperInvariant());
}
}

StringBuilderExtensions.cs

public static class StringBuilderExtensions
{
/// <summary>
/// Searches for the first index of the specified character. The search for
/// the character starts at the beginning and moves towards the end.
/// </summary>
/// <param name="text">This <see cref="StringBuilder"/>.</param>
/// <param name="value">The string to find.</param>
/// <returns>The index of the specified character, or -1 if the character isn't found.</returns>
public static int IndexOf(this StringBuilder text, string value)
{
return IndexOf(text, value, 0);
}

/// <summary>
/// Searches for the index of the specified character. The search for the
/// character starts at the specified offset and moves towards the end.
/// </summary>
/// <param name="text">This <see cref="StringBuilder"/>.</param>
/// <param name="value">The string to find.</param>
/// <param name="startIndex">The starting offset.</param>
/// <returns>The index of the specified character, or -1 if the character isn't found.</returns>
public static int IndexOf(this StringBuilder text, string value, int startIndex)
{
if (text == null)
throw new ArgumentNullException("text");
if (value == null)
throw new ArgumentNullException("value");

int index;
int length = value.Length;
int maxSearchLength = (text.Length - length) + 1;

for (int i = startIndex; i < maxSearchLength; ++i)
{
if (text[i] == value[0])
{
index = 1;
while ((index < length) && (text[i + index] == value[index]))
++index;

if (index == length)
return i;
}
}

return -1;
}

/// <summary>
/// Replaces the specified subsequence in this builder with the specified
/// string.
/// </summary>
/// <param name="text">this builder.</param>
/// <param name="start">the inclusive begin index.</param>
/// <param name="end">the exclusive end index.</param>
/// <param name="str">the replacement string.</param>
/// <returns>this builder.</returns>
/// <exception cref="IndexOutOfRangeException">
/// if <paramref name="start"/> is negative, greater than the current
/// <see cref="StringBuilder.Length"/> or greater than <paramref name="end"/>.
/// </exception>
/// <exception cref="ArgumentNullException">if <paramref name="str"/> is <c>null</c>.</exception>
public static StringBuilder Replace(this StringBuilder text, int start, int end, string str)
{
if (str == null)
{
throw new ArgumentNullException(nameof(str));
}
if (start >= 0)
{
if (end > text.Length)
{
end = text.Length;
}
if (end > start)
{
int stringLength = str.Length;
int diff = end - start - stringLength;
if (diff > 0)
{ // replacing with fewer characters
text.Remove(start, diff);
}
else if (diff < 0)
{
// replacing with more characters...need some room
text.Insert(start, new char[-diff]);
}
// copy the chars based on the new length
for (int i = 0; i < stringLength; i++)
{
text[i + start] = str[i];
}
return text;
}
if (start == end)
{

text.Insert(start, str);
return text;
}
}
throw new IndexOutOfRangeException();
}
}

Do note this is not very well tested and probably not very robust, but for testing purposes it should be fine.

Regular expressions for a range of unicode points PHP

You can use:

$foo = preg_replace('/[^\w$\x{0080}-\x{FFFF}]+/u', '', $foo);
  • \w - is equivalent of [a-zA-Z0-9_]
  • \x{0080}-\x{FFFF} to match characters between code points U+0080andU+FFFF`
  • /u for unicode support in regex

How do I specify a range of unicode characters in a regular-expression in python?

Try:

[\u00D8-\u00F6]

GNU find: regex to find a range of Unicode codepoints

You can pass a unicode-encoded regular expression. If you use bash,

$ find . -regex $'.*[\u0300-\u036f].*'
./foo/foòbar
./foo/asd͊fgh

The $'string' syntax converts the string like a C compiler would. If you do not use bash, your shell probably won't support this kind of string literal. You could then resort to something like

$ find . -regex $(echo -e '.*[\u0300-\u036f].*')

The normal findutils-default regex type supports this, and in my tests with findutils 4.7.0, so did all the others.

Does `std::wregex` support utf-16/unicode or only UCS-2?

C++ standard doesn't enforce any encoding on std::string and std::wstring. They're simply a series of CharT. Only std::u8string, std::u16string and std::u32string have defined encoding

  • What encoding does std::string.c_str() use?
  • Does std::string in c++ has encoding format

Similarly std::regex and std::wregex also wrap around std::basic_string and CharT. Their constructors accept std::basic_string and the encoding being used for std::basic_string will also be used for std::basic_regex. So what you said "std::regex isn't utf-8 or does not support it" is wrong. If the current locale is UTF-8 then std::regex and std::string will be UTF-8 (yes, modern Windows does support UTF-8 locale)

On Windows std::wstring uses UTF-16 so std::wregex also uses UTF-16. UCS-2 is deprecated and no one uses it anymore. You don't even need to differentiate between then since UCS-2 is just a subset of UTF-16 unless you use some very old tool that cuts in the middle of a surrogate pair. String searches in UTF-16 works exactly the same as in UCS-2 because UTF-16 is self-synchronized and a proper needle string can never match from the middle of a haystack. Same to UTF-8. If the tool doesn't understand UTF-16 then it's highly likely that it doesn't know that UTF-8 is variable length either, and will truncate the UTF-8 in the middle

Self-synchronization: The leading bytes and the continuation bytes do not share values (continuation bytes start with 10 while single bytes start with 0 and longer lead bytes start with 11). This means a search will not accidentally find the sequence for one character starting in the middle of another character. It also means the start of a character can be found from a random position by backing up at most 3 bytes to find the leading byte. An incorrect character will not be decoded if a stream starts mid-sequence, and a shorter sequence will never appear inside a longer one.

https://en.wikipedia.org/wiki/UTF-8#Description

The only things you need to care about are: avoid truncating in the middle of a character, and normalize the string before matching if necessary. The former issue can be avoided in UCS-2-only regex engines if you never use characters outside the BMP in a character class like commented. Replace them with a group instead

In other languages normalization takes place.

This is wrong. Some languages may do normalization before matching a regex, but that definitely doesn't apply to all "other languages"

If you want a little bit more assurance then use std::basic_regex<char8_t> and std::basic_regex<char16_t> for UTF-8 and UTF-16 respectively. You'll still need a UTF-16-aware library though, otherwise that'll still only work for regex strings that only contain words

The better solution may be changing to another library like ICU regex. You can check Comparison of regular expression engines for some suggestions. It even has a column indicating native UTF-16 support for each library

Related:

  • Do C++11 regular expressions work with UTF-8 strings?
  • How well is Unicode supported in C++11?
  • How do I properly use std::string on UTF-8 in C++?
  • How to use Unicode range in C++ regex

See also

  • Unicode Regular Expressions
  • Unicode Support in the Standard Library

Unicode characters in Regex

Just for reference you don't need to escape the above ',. in your character class [], and you can avoid having to escape the dash - by placing it at the beginning or end of your character class.

You can use \p{L} which matches any kind of letter from any language. See the example below:

string[] names = { "Brendán", "Jóhn", "Jason" };
Regex rgx = new Regex(@"^\p{L}+$");
foreach (string name in names)
Console.WriteLine("{0} {1} a valid name.", name, rgx.IsMatch(name) ? "is" : "is not");

// Brendán is a valid name.
// Jóhn is a valid name.
// Jason is a valid name.

Or simply just add the desired characters to your character class [] you want to include.

@"^[a-zA-Z0-9áéíóú@#%&',.\s-]+$"


Related Topics



Leave a reply



Submit