How Well Is Unicode Supported in C++11

How well is Unicode supported in C++11?

How well does the C++ standard library support unicode?

Terribly.

A quick scan through the library facilities that might provide Unicode support gives me this list:

  • Strings library
  • Localization library
  • Input/output library
  • Regular expressions library

I think all but the first one provide terrible support. I'll get back to it in more detail after a quick detour through your other questions.

Does std::string do what it should?

Yes. According to the C++ standard, this is what std::string and its siblings should do:

The class template basic_string describes objects that can store a sequence consisting of a varying number of arbitrary char-like objects with the first element of the sequence at position zero.

Well, std::string does that just fine. Does that provide any Unicode-specific functionality? No.

Should it? Probably not. std::string is fine as a sequence of char objects. That's useful; the only annoyance is that it is a very low-level view of text and standard C++ doesn't provide a higher-level one.

How do I use it?

Use it as a sequence of char objects; pretending it is something else is bound to end in pain.

Where are potential problems?

All over the place? Let's see...

Strings library

The strings library provides us basic_string, which is merely a sequence of what the standard calls "char-like objects". I call them code units. If you want a high-level view of text, this is not what you are looking for. This is a view of text suitable for serialization/deserialization/storage.

It also provides some tools from the C library that can be used to bridge the gap between the narrow world and the Unicode world: c16rtomb/mbrtoc16 and c32rtomb/mbrtoc32.

Localization library

The localization library still believes that one of those "char-like objects" equals one "character". This is of course silly, and makes it impossible to get lots of things working properly beyond some small subset of Unicode like ASCII.

Consider, for example, what the standard calls "convenience interfaces" in the <locale> header:

template <class charT> bool isspace (charT c, const locale& loc);
template <class charT> bool isprint (charT c, const locale& loc);
template <class charT> bool iscntrl (charT c, const locale& loc);
// ...
template <class charT> charT toupper(charT c, const locale& loc);
template <class charT> charT tolower(charT c, const locale& loc);
// ...

How do you expect any of these functions to properly categorize, say, U+1F34C ʙᴀɴᴀɴᴀ, as in u8"/code> or u8"\U0001F34C"? There's no way it will ever work, because those functions take only one code unit as input.

This could work with an appropriate locale if you used char32_t only: U'\U0001F34C' is a single code unit in UTF-32.

However, that still means you only get the simple casing transformations with toupper and tolower, which, for example, are not good enough for some German locales: "ß" uppercases to "SS"☦ but toupper can only return one character code unit.

Next up, wstring_convert/wbuffer_convert and the standard code conversion facets.

wstring_convert is used to convert between strings in one given encoding into strings in another given encoding. There are two string types involved in this transformation, which the standard calls a byte string and a wide string. Since these terms are really misleading, I prefer to use "serialized" and "deserialized", respectively, instead†.

The encodings to convert between are decided by a codecvt (a code conversion facet) passed as a template type argument to wstring_convert.

wbuffer_convert performs a similar function but as a wide deserialized stream buffer that wraps a byte serialized stream buffer. Any I/O is performed through the underlying byte serialized stream buffer with conversions to and from the encodings given by the codecvt argument. Writing serializes into that buffer, and then writes from it, and reading reads into the buffer and then deserializes from it.

The standard provides some codecvt class templates for use with these facilities: codecvt_utf8, codecvt_utf16, codecvt_utf8_utf16, and some codecvt specializations. Together these standard facets provide all the following conversions. (Note: in the following list, the encoding on the left is always the serialized string/streambuf, and the encoding on the right is always the deserialized string/streambuf; the standard allows conversions in both directions).

  • UTF-8 ↔ UCS-2 with codecvt_utf8<char16_t>, and codecvt_utf8<wchar_t> where sizeof(wchar_t) == 2;
  • UTF-8 ↔ UTF-32 with codecvt_utf8<char32_t>, codecvt<char32_t, char, mbstate_t>, and codecvt_utf8<wchar_t> where sizeof(wchar_t) == 4;
  • UTF-16 ↔ UCS-2 with codecvt_utf16<char16_t>, and codecvt_utf16<wchar_t> where sizeof(wchar_t) == 2;
  • UTF-16 ↔ UTF-32 with codecvt_utf16<char32_t>, and codecvt_utf16<wchar_t> where sizeof(wchar_t) == 4;
  • UTF-8 ↔ UTF-16 with codecvt_utf8_utf16<char16_t>, codecvt<char16_t, char, mbstate_t>, and codecvt_utf8_utf16<wchar_t> where sizeof(wchar_t) == 2;
  • narrow ↔ wide with codecvt<wchar_t, char_t, mbstate_t>
  • no-op with codecvt<char, char, mbstate_t>.

Several of these are useful, but there is a lot of awkward stuff here.

First off—holy high surrogate! that naming scheme is messy.

Then, there's a lot of UCS-2 support. UCS-2 is an encoding from Unicode 1.0 that was superseded in 1996 because it only supports the basic multilingual plane. Why the committee thought desirable to focus on an encoding that was superseded over 20 years ago, I don't know‡. It's not like support for more encodings is bad or anything, but UCS-2 shows up too often here.

I would say that char16_t is obviously meant for storing UTF-16 code units. However, this is one part of the standard that thinks otherwise. codecvt_utf8<char16_t> has nothing to do with UTF-16. For example, wstring_convert<codecvt_utf8<char16_t>>().to_bytes(u"\U0001F34C") will compile fine, but will fail unconditionally: the input will be treated as the UCS-2 string u"\xD83C\xDF4C", which cannot be converted to UTF-8 because UTF-8 cannot encode any value in the range 0xD800-0xDFFF.

Still on the UCS-2 front, there is no way to read from an UTF-16 byte stream into an UTF-16 string with these facets. If you have a sequence of UTF-16 bytes you can't deserialize it into a string of char16_t. This is surprising, because it is more or less an identity conversion. Even more suprising, though, is the fact that there is support for deserializing from an UTF-16 stream into an UCS-2 string with codecvt_utf16<char16_t>, which is actually a lossy conversion.

The UTF-16-as-bytes support is quite good, though: it supports detecting endianess from a BOM, or selecting it explicitly in code. It also supports producing output with and without a BOM.

There are some more interesting conversion possibilities absent. There is no way to deserialize from an UTF-16 byte stream or string into a UTF-8 string, since UTF-8 is never supported as the deserialized form.

And here the narrow/wide world is completely separate from the UTF/UCS world. There are no conversions between the old-style narrow/wide encodings and any Unicode encodings.

Input/output library

The I/O library can be used to read and write text in Unicode encodings using the wstring_convert and wbuffer_convert facilities described above. I don't think there's much else that would need to be supported by this part of the standard library.

Regular expressions library

I have expounded upon problems with C++ regexes and Unicode on Stack Overflow before. I will not repeat all those points here, but merely state that C++ regexes don't have level 1 Unicode support, which is the bare minimum to make them usable without resorting to using UTF-32 everywhere.

That's it?

Yes, that's it. That's the existing functionality. There's lots of Unicode functionality that is nowhere to be seen like normalization or text segmentation algorithms.

U+1F4A9. Is there any way to get some better Unicode support in C++?

The usual suspects: ICU and Boost.Locale.


† A byte string is, unsurprisingly, a string of bytes, i.e., char objects. However, unlike a wide string literal, which is always an array of wchar_t objects, a "wide string" in this context is not necessarily a string of wchar_t objects. In fact, the standard never explicitly defines what a "wide string" means, so we're left to guess the meaning from usage. Since the standard terminology is sloppy and confusing, I use my own, in the name of clarity.

Encodings like UTF-16 can be stored as sequences of char16_t, which then have no endianness; or they can be stored as sequences of bytes, which have endianness (each consecutive pair of bytes can represent a different char16_t value depending on endianness). The standard supports both of these forms. A sequence of char16_t is more useful for internal manipulation in the program. A sequence of bytes is the way to exchange such strings with the external world. The terms I'll use instead of "byte" and "wide" are thus "serialized" and "deserialized".

‡ If you are about to say "but Windows!" hold your . All versions of Windows since Windows 2000 use UTF-16.

☦ Yes, I know about the großes Eszett (ẞ), but even if you were to change all German locales overnight to have ß uppercase to ẞ, there's still plenty of other cases where this would fail. Try uppercasing U+FB00 ʟᴀᴛɪɴ sᴍᴀʟʟ ʟɪɢᴀᴛᴜʀᴇ ғғ. There is no ʟᴀᴛɪɴ ᴄᴀᴘɪᴛᴀʟ ʟɪɢᴀᴛᴜʀᴇ ғғ; it just uppercases to two Fs. Or U+01F0 ʟᴀᴛɪɴ sᴍᴀʟʟ ʟᴇᴛᴛᴇʀ ᴊ ᴡɪᴛʜ ᴄᴀʀᴏɴ; there's no precomposed capital; it just uppercases to a capital J and a combining caron.

C11 Unicode Support

That's a good question with no apparent answer.

The uchar.h types and functions added in C11 are largely useless. They only support conversions between the new type (char16_t or char32_t) and the locale-specific, implementation-defined multibyte encoding, mappings which are not going to be complete unless the locale is UTF-8 based. The useful conversions (to/from wchar_t, and to/from UTF-8) are not supported. Of course you can roll your own for conversions to/from UTF-8 since these conversions are 100% specified by the relevant RFCs/UCS/Unicode standards, but be careful: most people implement them wrong and have dangerous bugs.

Note that the new compiler-level features for UTF-8, UTF-16, and UTF-32 literals (u8, u, and U, respectively) are potentially useful; you can process the resulting strings with your own functions in meaningful ways that don't depend at all on locale. But the library-level support for Unicode in C11 is, in my opinion, basically useless.

Unicode in C++11

Is the above analysis correct

Let's see.

you can't validate an array of bytes as containing valid UTF-8

Incorrect. std::codecvt_utf8<char32_t>::length(start, end, max_lenght) returns the number of valid bytes in the array.

you can't find out the length

Partially correct. One can convert to char32_t and find out the length of the result. There is no easy way to find out the length without doing the actual conversion (but see below). I must say that need to count characters (in any sense) arises rather infrequently.

you can't iterate over a std::string in any way other than byte-by-byte

Incorrect. std::codecvt_utf8<char32_t>::length(start, end, 1) gives you a possibility to iterate over UTF-8 "characters" (Unicode code units), and of course determine their number (that's not an "easy" way to count the number of characters, but it's a way).

doesn't really support UTF-16

Incorrect. One can convert to and from UTF-16 with e.g. std::codecvt_utf8_utf16<char16_t>. A result of conversion to UTF-16 is, well, UTF-16. It is not restricted to BMP.

Demo that illustrates these points.

If I have missed some other "you can't", please point it out and I will address it.

Important addendum. These facilities are deprecated in C++17. This probably means they will go away in some future version of C++. Use them at your own risk. All these things enumerated in original question now cannot (safely) be done again, using only the standard library.

Standard way in C11 and C++11 to convert UTF-8?

Sounds like you're looking for the std::codecvt type. See the example on that page for usage.

Unicode Processing in C++

  • Use ICU for dealing with your data (or a similar library)
  • In your own data store, make sure everything is stored in the same encoding
  • Make sure you are always using your unicode library for mundane tasks like string length, capitalization status, etc. Never use standard library builtins like is_alpha unless that is the definition you want.
  • I can't say it enough: never iterate over the indices of a string if you care about correctness, always use your unicode library for this.

If a C++ compiler supports Unicode character set, does it necessary that the basic character set of the implementation is also Unicode?

Does it mean the basic character set of this implementation is also Unicode?

No, there is no such requirement, and there are very few implementations where char is large enough to hold arbitrary Unicode characters.

char is large enough to hold members of the basic character set, but what happens with characters that aren't in the basic character set depends.

On some systems, everything might be converted to one character set such as ISO8859-1 which has fewer than 256 characters, so fits entirely in char.

On other systems, everything might be encoded as UTF-8, meaning a single logical character potentially takes up several char values.

C++ Correctly read files whose Unicode characters might be larger than a byte

I'd recommend watching Unicode in C++ by James McNellis.

That will help explain what facilitates C++ has and does not have when dealing with Unicode.

You will see that C++ lacks good support for easily working with UTF8.

Since it sounds like you want to iterate over each glyph (not just code points),

I'd recomend using a 3rd pary library to handle the intricacies.

utfcpp has worked well for me.

Basic issue regarding full unicode in C++

You are in the gray zone of C++ unicode. Unicode initially started by an extension of the 7 bits ASCII characters, or multi-byte characters to plain 16 bits characters, what later became the BMP. Those 16 bits characters were adopted natively by languages like Java and systems like Windows. C and C++ being more conservative on a standard point of view decided that wchar_t would be an implementation dependant wide character set that could be 16 or 32 bits wide (or even more...) depending on requirement. The good side was that it was extensible, the dark side was that it was never made clear how non BMP unicode characters should be represented when wchar_t is only 16 bits.

UTF-16 was then created to allow a standard representation of those non BMP characters, with the downside that they need 2 16 bits characters, and that the std::char_traits<wchar_t>::length would again be wrong if some of them are present in a wstring.

That's the reason why most C++ implementation choosed that wchar_t basic IO would only process correctly BMP unicode characters for length to return a true number of characters.

The C++-ish way is to use char32_t based strings when full unicode support is required. In fact wstring_t and wchar_t (prefix L for litteral) are implementation dependant types, and since C++11, you also have char16_t and u16string (prefix u) that explicitely use UTF-16, or char32_t and u32string (prefix U) for full unicode support through UTF-32. The problem of storing characters outside the BMP in a u16string, is that you lose the property size of string == number of characters, which was a key reason for using wide characters instead of multi-byte characters.

One problem for u32string is that the io library still has no direct specialization for 32 bit characters, but as the converters have, you can probably use them easily when you process files with a std::basic_fstream<char32_t> (untested but according to standard should work). But you will have no standard stream for cin, cout and cerr, and will probably have to process the native from in string or u16string, and then convert everything in u32string with the help of the standard converters introduced in C++14, or the hard way if using only C++11.

The really dark side, is that as that native part currently depend on the OS, you will not be able to setup a fully portable way to process full unicode - or at least I know none.

Does `std::wregex` support utf-16/unicode or only UCS-2?

C++ standard doesn't enforce any encoding on std::string and std::wstring. They're simply a series of CharT. Only std::u8string, std::u16string and std::u32string have defined encoding

  • What encoding does std::string.c_str() use?
  • Does std::string in c++ has encoding format

Similarly std::regex and std::wregex also wrap around std::basic_string and CharT. Their constructors accept std::basic_string and the encoding being used for std::basic_string will also be used for std::basic_regex. So what you said "std::regex isn't utf-8 or does not support it" is wrong. If the current locale is UTF-8 then std::regex and std::string will be UTF-8 (yes, modern Windows does support UTF-8 locale)

On Windows std::wstring uses UTF-16 so std::wregex also uses UTF-16. UCS-2 is deprecated and no one uses it anymore. You don't even need to differentiate between then since UCS-2 is just a subset of UTF-16 unless you use some very old tool that cuts in the middle of a surrogate pair. String searches in UTF-16 works exactly the same as in UCS-2 because UTF-16 is self-synchronized and a proper needle string can never match from the middle of a haystack. Same to UTF-8. If the tool doesn't understand UTF-16 then it's highly likely that it doesn't know that UTF-8 is variable length either, and will truncate the UTF-8 in the middle

Self-synchronization: The leading bytes and the continuation bytes do not share values (continuation bytes start with 10 while single bytes start with 0 and longer lead bytes start with 11). This means a search will not accidentally find the sequence for one character starting in the middle of another character. It also means the start of a character can be found from a random position by backing up at most 3 bytes to find the leading byte. An incorrect character will not be decoded if a stream starts mid-sequence, and a shorter sequence will never appear inside a longer one.

https://en.wikipedia.org/wiki/UTF-8#Description

The only things you need to care about are: avoid truncating in the middle of a character, and normalize the string before matching if necessary. The former issue can be avoided in UCS-2-only regex engines if you never use characters outside the BMP in a character class like commented. Replace them with a group instead

In other languages normalization takes place.

This is wrong. Some languages may do normalization before matching a regex, but that definitely doesn't apply to all "other languages"

If you want a little bit more assurance then use std::basic_regex<char8_t> and std::basic_regex<char16_t> for UTF-8 and UTF-16 respectively. You'll still need a UTF-16-aware library though, otherwise that'll still only work for regex strings that only contain words

The better solution may be changing to another library like ICU regex. You can check Comparison of regular expression engines for some suggestions. It even has a column indicating native UTF-16 support for each library

Related:

  • Do C++11 regular expressions work with UTF-8 strings?
  • How well is Unicode supported in C++11?
  • How do I properly use std::string on UTF-8 in C++?
  • How to use Unicode range in C++ regex

See also

  • Unicode Regular Expressions
  • Unicode Support in the Standard Library


Related Topics



Leave a reply



Submit