How well is Unicode supported in C++11?
How well does the C++ standard library support unicode?
Terribly.
A quick scan through the library facilities that might provide Unicode support gives me this list:
- Strings library
- Localization library
- Input/output library
- Regular expressions library
I think all but the first one provide terrible support. I'll get back to it in more detail after a quick detour through your other questions.
Does
std::string
do what it should?
Yes. According to the C++ standard, this is what std::string
and its siblings should do:
The class template
basic_string
describes objects that can store a sequence consisting of a varying number of arbitrary char-like objects with the first element of the sequence at position zero.
Well, std::string
does that just fine. Does that provide any Unicode-specific functionality? No.
Should it? Probably not. std::string
is fine as a sequence of char
objects. That's useful; the only annoyance is that it is a very low-level view of text and standard C++ doesn't provide a higher-level one.
How do I use it?
Use it as a sequence of char
objects; pretending it is something else is bound to end in pain.
Where are potential problems?
All over the place? Let's see...
Strings library
The strings library provides us basic_string
, which is merely a sequence of what the standard calls "char-like objects". I call them code units. If you want a high-level view of text, this is not what you are looking for. This is a view of text suitable for serialization/deserialization/storage.
It also provides some tools from the C library that can be used to bridge the gap between the narrow world and the Unicode world: c16rtomb
/mbrtoc16
and c32rtomb
/mbrtoc32
.
Localization library
The localization library still believes that one of those "char-like objects" equals one "character". This is of course silly, and makes it impossible to get lots of things working properly beyond some small subset of Unicode like ASCII.
Consider, for example, what the standard calls "convenience interfaces" in the <locale>
header:
template <class charT> bool isspace (charT c, const locale& loc);
template <class charT> bool isprint (charT c, const locale& loc);
template <class charT> bool iscntrl (charT c, const locale& loc);
// ...
template <class charT> charT toupper(charT c, const locale& loc);
template <class charT> charT tolower(charT c, const locale& loc);
// ...
How do you expect any of these functions to properly categorize, say, U+1F34C ʙᴀɴᴀɴᴀ, as in u8"/code> or
u8"\U0001F34C"
? There's no way it will ever work, because those functions take only one code unit as input.
This could work with an appropriate locale if you used char32_t
only: U'\U0001F34C'
is a single code unit in UTF-32.
However, that still means you only get the simple casing transformations with toupper
and tolower
, which, for example, are not good enough for some German locales: "ß" uppercases to "SS"☦ but toupper
can only return one character code unit.
Next up, wstring_convert
/wbuffer_convert
and the standard code conversion facets.
wstring_convert
is used to convert between strings in one given encoding into strings in another given encoding. There are two string types involved in this transformation, which the standard calls a byte string and a wide string. Since these terms are really misleading, I prefer to use "serialized" and "deserialized", respectively, instead†.
The encodings to convert between are decided by a codecvt (a code conversion facet) passed as a template type argument to wstring_convert
.
wbuffer_convert
performs a similar function but as a wide deserialized stream buffer that wraps a byte serialized stream buffer. Any I/O is performed through the underlying byte serialized stream buffer with conversions to and from the encodings given by the codecvt argument. Writing serializes into that buffer, and then writes from it, and reading reads into the buffer and then deserializes from it.
The standard provides some codecvt class templates for use with these facilities: codecvt_utf8
, codecvt_utf16
, codecvt_utf8_utf16
, and some codecvt
specializations. Together these standard facets provide all the following conversions. (Note: in the following list, the encoding on the left is always the serialized string/streambuf, and the encoding on the right is always the deserialized string/streambuf; the standard allows conversions in both directions).
- UTF-8 ↔ UCS-2 with
codecvt_utf8<char16_t>
, andcodecvt_utf8<wchar_t>
wheresizeof(wchar_t) == 2
; - UTF-8 ↔ UTF-32 with
codecvt_utf8<char32_t>
,codecvt<char32_t, char, mbstate_t>
, andcodecvt_utf8<wchar_t>
wheresizeof(wchar_t) == 4
; - UTF-16 ↔ UCS-2 with
codecvt_utf16<char16_t>
, andcodecvt_utf16<wchar_t>
wheresizeof(wchar_t) == 2
; - UTF-16 ↔ UTF-32 with
codecvt_utf16<char32_t>
, andcodecvt_utf16<wchar_t>
wheresizeof(wchar_t) == 4
; - UTF-8 ↔ UTF-16 with
codecvt_utf8_utf16<char16_t>
,codecvt<char16_t, char, mbstate_t>
, andcodecvt_utf8_utf16<wchar_t>
wheresizeof(wchar_t) == 2
; - narrow ↔ wide with
codecvt<wchar_t, char_t, mbstate_t>
- no-op with
codecvt<char, char, mbstate_t>
.
Several of these are useful, but there is a lot of awkward stuff here.
First off—holy high surrogate! that naming scheme is messy.
Then, there's a lot of UCS-2 support. UCS-2 is an encoding from Unicode 1.0 that was superseded in 1996 because it only supports the basic multilingual plane. Why the committee thought desirable to focus on an encoding that was superseded over 20 years ago, I don't know‡. It's not like support for more encodings is bad or anything, but UCS-2 shows up too often here.
I would say that char16_t
is obviously meant for storing UTF-16 code units. However, this is one part of the standard that thinks otherwise. codecvt_utf8<char16_t>
has nothing to do with UTF-16. For example, wstring_convert<codecvt_utf8<char16_t>>().to_bytes(u"\U0001F34C")
will compile fine, but will fail unconditionally: the input will be treated as the UCS-2 string u"\xD83C\xDF4C"
, which cannot be converted to UTF-8 because UTF-8 cannot encode any value in the range 0xD800-0xDFFF.
Still on the UCS-2 front, there is no way to read from an UTF-16 byte stream into an UTF-16 string with these facets. If you have a sequence of UTF-16 bytes you can't deserialize it into a string of char16_t
. This is surprising, because it is more or less an identity conversion. Even more suprising, though, is the fact that there is support for deserializing from an UTF-16 stream into an UCS-2 string with codecvt_utf16<char16_t>
, which is actually a lossy conversion.
The UTF-16-as-bytes support is quite good, though: it supports detecting endianess from a BOM, or selecting it explicitly in code. It also supports producing output with and without a BOM.
There are some more interesting conversion possibilities absent. There is no way to deserialize from an UTF-16 byte stream or string into a UTF-8 string, since UTF-8 is never supported as the deserialized form.
And here the narrow/wide world is completely separate from the UTF/UCS world. There are no conversions between the old-style narrow/wide encodings and any Unicode encodings.
Input/output library
The I/O library can be used to read and write text in Unicode encodings using the wstring_convert
and wbuffer_convert
facilities described above. I don't think there's much else that would need to be supported by this part of the standard library.
Regular expressions library
I have expounded upon problems with C++ regexes and Unicode on Stack Overflow before. I will not repeat all those points here, but merely state that C++ regexes don't have level 1 Unicode support, which is the bare minimum to make them usable without resorting to using UTF-32 everywhere.
That's it?
Yes, that's it. That's the existing functionality. There's lots of Unicode functionality that is nowhere to be seen like normalization or text segmentation algorithms.
U+1F4A9. Is there any way to get some better Unicode support in C++?
The usual suspects: ICU and Boost.Locale.
† A byte string is, unsurprisingly, a string of bytes, i.e., char
objects. However, unlike a wide string literal, which is always an array of wchar_t
objects, a "wide string" in this context is not necessarily a string of wchar_t
objects. In fact, the standard never explicitly defines what a "wide string" means, so we're left to guess the meaning from usage. Since the standard terminology is sloppy and confusing, I use my own, in the name of clarity.
Encodings like UTF-16 can be stored as sequences of char16_t
, which then have no endianness; or they can be stored as sequences of bytes, which have endianness (each consecutive pair of bytes can represent a different char16_t
value depending on endianness). The standard supports both of these forms. A sequence of char16_t
is more useful for internal manipulation in the program. A sequence of bytes is the way to exchange such strings with the external world. The terms I'll use instead of "byte" and "wide" are thus "serialized" and "deserialized".
‡ If you are about to say "but Windows!" hold your . All versions of Windows since Windows 2000 use UTF-16.
☦ Yes, I know about the großes Eszett (ẞ), but even if you were to change all German locales overnight to have ß uppercase to ẞ, there's still plenty of other cases where this would fail. Try uppercasing U+FB00 ʟᴀᴛɪɴ sᴍᴀʟʟ ʟɪɢᴀᴛᴜʀᴇ ғғ. There is no ʟᴀᴛɪɴ ᴄᴀᴘɪᴛᴀʟ ʟɪɢᴀᴛᴜʀᴇ ғғ; it just uppercases to two Fs. Or U+01F0 ʟᴀᴛɪɴ sᴍᴀʟʟ ʟᴇᴛᴛᴇʀ ᴊ ᴡɪᴛʜ ᴄᴀʀᴏɴ; there's no precomposed capital; it just uppercases to a capital J and a combining caron.
C11 Unicode Support
That's a good question with no apparent answer.
The uchar.h
types and functions added in C11 are largely useless. They only support conversions between the new type (char16_t
or char32_t
) and the locale-specific, implementation-defined multibyte encoding, mappings which are not going to be complete unless the locale is UTF-8 based. The useful conversions (to/from wchar_t
, and to/from UTF-8) are not supported. Of course you can roll your own for conversions to/from UTF-8 since these conversions are 100% specified by the relevant RFCs/UCS/Unicode standards, but be careful: most people implement them wrong and have dangerous bugs.
Note that the new compiler-level features for UTF-8, UTF-16, and UTF-32 literals (u8
, u
, and U
, respectively) are potentially useful; you can process the resulting strings with your own functions in meaningful ways that don't depend at all on locale. But the library-level support for Unicode in C11 is, in my opinion, basically useless.
Unicode in C++11
Is the above analysis correct
Let's see.
you can't validate an array of bytes as containing valid UTF-8
Incorrect. std::codecvt_utf8<char32_t>::length(start, end, max_lenght)
returns the number of valid bytes in the array.
you can't find out the length
Partially correct. One can convert to char32_t and find out the length of the result. There is no easy way to find out the length without doing the actual conversion (but see below). I must say that need to count characters (in any sense) arises rather infrequently.
you can't iterate over a std::string in any way other than byte-by-byte
Incorrect. std::codecvt_utf8<char32_t>::length(start, end, 1)
gives you a possibility to iterate over UTF-8 "characters" (Unicode code units), and of course determine their number (that's not an "easy" way to count the number of characters, but it's a way).
doesn't really support UTF-16
Incorrect. One can convert to and from UTF-16 with e.g. std::codecvt_utf8_utf16<char16_t>
. A result of conversion to UTF-16 is, well, UTF-16. It is not restricted to BMP.
Demo that illustrates these points.
If I have missed some other "you can't", please point it out and I will address it.
Important addendum. These facilities are deprecated in C++17. This probably means they will go away in some future version of C++. Use them at your own risk. All these things enumerated in original question now cannot (safely) be done again, using only the standard library.
Standard way in C11 and C++11 to convert UTF-8?
Sounds like you're looking for the std::codecvt type. See the example on that page for usage.
Unicode Processing in C++
- Use ICU for dealing with your data (or a similar library)
- In your own data store, make sure everything is stored in the same encoding
- Make sure you are always using your unicode library for mundane tasks like string length, capitalization status, etc. Never use standard library builtins like
is_alpha
unless that is the definition you want. - I can't say it enough: never iterate over the indices of a
string
if you care about correctness, always use your unicode library for this.
If a C++ compiler supports Unicode character set, does it necessary that the basic character set of the implementation is also Unicode?
Does it mean the basic character set of this implementation is also Unicode?
No, there is no such requirement, and there are very few implementations where char
is large enough to hold arbitrary Unicode characters.
char
is large enough to hold members of the basic character set, but what happens with characters that aren't in the basic character set depends.
On some systems, everything might be converted to one character set such as ISO8859-1 which has fewer than 256 characters, so fits entirely in char
.
On other systems, everything might be encoded as UTF-8, meaning a single logical character potentially takes up several char
values.
C++ Correctly read files whose Unicode characters might be larger than a byte
I'd recommend watching Unicode in C++ by James McNellis.
That will help explain what facilitates C++ has and does not have when dealing with Unicode.
You will see that C++ lacks good support for easily working with UTF8.
Since it sounds like you want to iterate over each glyph (not just code points),
I'd recomend using a 3rd pary library to handle the intricacies.
utfcpp has worked well for me.
Basic issue regarding full unicode in C++
You are in the gray zone of C++ unicode. Unicode initially started by an extension of the 7 bits ASCII characters, or multi-byte characters to plain 16 bits characters, what later became the BMP. Those 16 bits characters were adopted natively by languages like Java and systems like Windows. C and C++ being more conservative on a standard point of view decided that wchar_t
would be an implementation dependant wide character set that could be 16 or 32 bits wide (or even more...) depending on requirement. The good side was that it was extensible, the dark side was that it was never made clear how non BMP unicode characters should be represented when wchar_t is only 16 bits.
UTF-16 was then created to allow a standard representation of those non BMP characters, with the downside that they need 2 16 bits characters, and that the std::char_traits<wchar_t>::length
would again be wrong if some of them are present in a wstring.
That's the reason why most C++ implementation choosed that wchar_t
basic IO would only process correctly BMP unicode characters for length
to return a true number of characters.
The C++-ish way is to use char32_t
based strings when full unicode support is required. In fact wstring_t
and wchar_t
(prefix L for litteral) are implementation dependant types, and since C++11, you also have char16_t
and u16string
(prefix u) that explicitely use UTF-16, or char32_t
and u32string
(prefix U) for full unicode support through UTF-32. The problem of storing characters outside the BMP in a u16string, is that you lose the property size of string == number of characters, which was a key reason for using wide characters instead of multi-byte characters.
One problem for u32string is that the io library still has no direct specialization for 32 bit characters, but as the converters have, you can probably use them easily when you process files with a std::basic_fstream<char32_t>
(untested but according to standard should work). But you will have no standard stream for cin
, cout
and cerr
, and will probably have to process the native from in string
or u16string
, and then convert everything in u32string
with the help of the standard converters introduced in C++14, or the hard way if using only C++11.
The really dark side, is that as that native part currently depend on the OS, you will not be able to setup a fully portable way to process full unicode - or at least I know none.
Does `std::wregex` support utf-16/unicode or only UCS-2?
C++ standard doesn't enforce any encoding on std::string
and std::wstring
. They're simply a series of CharT
. Only std::u8string
, std::u16string
and std::u32string
have defined encoding
- What encoding does std::string.c_str() use?
- Does std::string in c++ has encoding format
Similarly std::regex
and std::wregex
also wrap around std::basic_string
and CharT
. Their constructors accept std::basic_string
and the encoding being used for std::basic_string
will also be used for std::basic_regex
. So what you said "std::regex isn't utf-8 or does not support it" is wrong. If the current locale is UTF-8 then std::regex
and std::string
will be UTF-8 (yes, modern Windows does support UTF-8 locale)
On Windows std::wstring
uses UTF-16 so std::wregex
also uses UTF-16. UCS-2 is deprecated and no one uses it anymore. You don't even need to differentiate between then since UCS-2 is just a subset of UTF-16 unless you use some very old tool that cuts in the middle of a surrogate pair. String searches in UTF-16 works exactly the same as in UCS-2 because UTF-16 is self-synchronized and a proper needle string can never match from the middle of a haystack. Same to UTF-8. If the tool doesn't understand UTF-16 then it's highly likely that it doesn't know that UTF-8 is variable length either, and will truncate the UTF-8 in the middle
Self-synchronization: The leading bytes and the continuation bytes do not share values (continuation bytes start with 10 while single bytes start with 0 and longer lead bytes start with 11). This means a search will not accidentally find the sequence for one character starting in the middle of another character. It also means the start of a character can be found from a random position by backing up at most 3 bytes to find the leading byte. An incorrect character will not be decoded if a stream starts mid-sequence, and a shorter sequence will never appear inside a longer one.
https://en.wikipedia.org/wiki/UTF-8#Description
The only things you need to care about are: avoid truncating in the middle of a character, and normalize the string before matching if necessary. The former issue can be avoided in UCS-2-only regex engines if you never use characters outside the BMP in a character class like commented. Replace them with a group instead
In other languages normalization takes place.
This is wrong. Some languages may do normalization before matching a regex, but that definitely doesn't apply to all "other languages"
If you want a little bit more assurance then use std::basic_regex<char8_t>
and std::basic_regex<char16_t>
for UTF-8 and UTF-16 respectively. You'll still need a UTF-16-aware library though, otherwise that'll still only work for regex strings that only contain words
The better solution may be changing to another library like ICU regex. You can check Comparison of regular expression engines for some suggestions. It even has a column indicating native UTF-16 support for each library
Related:
- Do C++11 regular expressions work with UTF-8 strings?
- How well is Unicode supported in C++11?
- How do I properly use std::string on UTF-8 in C++?
- How to use Unicode range in C++ regex
See also
- Unicode Regular Expressions
- Unicode Support in the Standard Library
Related Topics
Why Are Standard Iterator Ranges [Begin, End) Instead of [Begin, End]
Windows Threading: _Beginthread VS _Beginthreadex VS Createthread C++
Function Passed as Template Argument
Debugging With Command-Line Parameters in Visual Studio
How to Sort a Std::Vector by the Values of a Different Std::Vector
How to Create an Std::Function from a Move-Capturing Lambda Expression
Two Phase Lookup - Explanation Needed
What Is the Lifetime of the Result of Std::String::C_Str()
Capturing Function Exit Time With _Gnu_Mcount_Nc
Difference Between Static and Shared Libraries
Range Based Loop: Get Item by Value or Reference to Const
What Happens to Global and Static Variables in a Shared Library When It Is Dynamically Linked
How to Make a Fully Statically Linked .Exe With Visual Studio Express 2005
Static Initialization Order Fiasco
How to Solve the 32-Byte-Alignment Issue For Avx Load/Store Operations