Using Unicode in C++ Source Code

Using Unicode in C++ source code

Encoding in C++ is quite a bit complicated. Here is my understanding of it.

Every implementation has to support characters from the basic source character set. These include common characters listed in §2.2/1 (§2.3/1 in C++11). These characters should all fit into one char. In addition implementations have to support a way to name other characters using a way called universal-character-names and look like \uffff or \Uffffffff and can be used to refer to Unicode characters. A subset of them are usable in identifiers (listed in Annex E).

This is all nice, but the mapping from characters in the file, to source characters (used at compile time) is implementation defined. This constitutes the encoding used. Here is what it says literally (C++98 version):

Physical source file characters are
mapped, in an implementation-defined
manner, to the basic source character
set (introducing new-line characters
for end-of-line indicators) if
necessary. Trigraph sequences (2.3)
are replaced by corresponding
single-character internal
representations. Any source file
character not in the basic source
character set (2.2) is replaced by the
universal-character-name that des-
ignates that character. (An
implementation may use any internal
encoding, so long as an actual
extended character encountered in the
source file, and the same extended
character expressed in the source file
as a universal-character-name (i.e.
using the \uXXXX notation), are
handled equivalently.)

For gcc, you can change it using the option -finput-charset=charset. Additionally, you can change the execution character used to represet values at runtime. The proper option for this is -fexec-charset=charset for char (it defaults to utf-8) and -fwide-exec-charset=charset (which defaults to either utf-16 or utf-32 depending on the size of wchar_t).

Unicode Identifiers and Source Code in C++11?

Is the new standard more open w.r.t to Unicode?

With respect to allowing universal character names in identifiers the answer is no; UCNs were allowed in identifiers back in C99 and C++98. However compilers did not implement that particular requirement until recently. Clang 3.3 I think introduces support for this and GCC has had an experimental feature for this for some time. Herb Sutter also mentioned during his Build 2013 talk "The Future of C++" that this feature would also be coming to VC++ at some point. (Although IIRC Herb refers to it as a C++11 feature; it is in fact a C++98 feature.)

It's not expected that identifiers will be written using UCNs. Instead the expected behavior is to write the desired character using the source encoding. E.g., source will look like:

long pörk;

not:

long p\u00F6rk;

However UCNs are also useful for another purpose; Compilers are not all required to accept the same source encodings, but modern compilers all support some encoding scheme where at least the basic source characters have the same encoding (that is, modern compilers all support some ASCII compatible encoding).

UCNs allow you to write source code with only the basic characters and yet still name extended characters. This is useful in, for example, writing a string literal "°" in source code that will be compiled both as CP1252 and as UTF-8:

char const *degree_sign = "\u00b0";

This string literal is encoded into the appropriate execution encoding on multiple compilers, even when the source encodings differ, as long as the compilers at least share the same encoding for basic characters.

Can (portable) source code be in any unicode encoding, like UTF-8, UTF-16 or any (how-ever-defined) codepage?

It's not required by the standard, but most compilers will accept UTF-8 source. Clang supports only UTF-8 source (although it has some compatibility for non-UTF-8 data in character and string literals), gcc allows the source encoding to be specified and includes support for UTF-8, and VC++ will guess at the encoding and can be made to guess UTF-8.

(Update: VS2015 now provides an option to force the source and execution character sets to be UTF-8.)

Can I write an identifier with \u1234 in it myfu\u1234ntion (for whatever purpose)

Yes, the specification mandates this, although as I said not all compilers implement this requirement yet.

Or can i use the "character names" that unicode defines like in the ICU, i.e.
const auto x = "German Braunb\U{LOWERCASE LETTER A WITH DIARESIS}r."u32;

No, you cannot use Unicode long names.

or even in an identifier in the source itself? That would be a treat... cough...

If the compiler supports a source code encoding that contains the extended character you want then that character written literally in the source must be treated exactly the same as the equivalent UCN. So yes, if you use a compiler that supports this requirement of the C++ spec then you may write any character in its source character set directly in the source without bothering with writing UCNs.

Using Unicode in a C++ source file

Personally, I don't use any non-ASCII characters in source code. The reason is that if you use arbitary Unicode characters in your source files, you have to worry about the encoding that the compiler considers the source file to be in, what execution character set it will use and how it's going to do the source to execution character set conversion.

I think that it's a much better idea to have Unicode data in some sort of resource file, which could be compiled to static data at compile time or loaded at runtime for maximum flexibility. That way you can control how the encoding occurs, at not worry about how the compiler behaves which may be influence by the local locale settings at compile time.

It does require a bit more infrastructure, but if you're having to internationalize it's well worth spending the time choosing or developing a flexible and robust strategy.

While it's possible to use universal character escapes (L'\uXXXX') or explicitly encoded byte sequences ("\xXX\xYY\xZZ") in source code, this makes Unicode strings virtually unreadable for humans. If you're having translations made it's easier for most people involved in the process to be able to deal with text in an agreed universal character encoding scheme.

C programming: How to program for Unicode?

Note that this is not about "strict unicode programming" per se, but some practical experience.

What we did at my company was to create a wrapper library around IBM's ICU library. The wrapper library has a UTF-8 interface and converts to UTF-16 when it is necessary to call ICU. In our case, we did not worry too much about performance hits. When performance was an issue, we also supplied UTF-16 interfaces (using our own datatype).

Applications could remain largely as-is (using char), although in some cases they need to be aware of certain issues. For instance, instead of strncpy() we use a wrapper which avoids cutting off UTF-8 sequences. In our case, this is sufficient, but one could also consider checks for combining characters. We also have wrappers for counting the number of codepoints, the number of graphemes, etc.

When interfacing with other systems, we sometimes need to do custom character composition, so you may need some flexibility there (depending on your application).

We do not use wchar_t. Using ICU avoids unexpected issues in portability (but not other unexpected issues, of course :-).

Writing unicode C++ source code

First, regardless of what the interface says, the question isn't
Unicode or not, but UTF-16 or UTF-8. Practically speaking, for
external data, you should only use UTF-8. Internally, it
depends on what you are doing. Conversion of UTF-8 to UTF-16 is
an added complication, but for more complex operations, it may
be easier to work in UTF-16. (Although the differences between
UTF-8 and UTF-16 aren't enormous. To reap any real benefits,
you'd have to use UTF-32, and even then...)

In practice, I would avoid the W functions completely, and
always use char const* at the system interface level. But
again, it depends on what you are doing. This is just a general
guideline. For the rest, I'd stick with std::string unless
there was some strong reason not to.

How to print Unicode character in C++?

To represent the character you can use Universal Character Names (UCNs). The character 'ф' has the Unicode value U+0444 and so in C++ you could write it '\u0444' or '\U00000444'. Also if the source code encoding supports this character then you can just write it literally in your source code.

// both of these assume that the character can be represented with
// a single char in the execution encoding
char b = '\u0444';
char a = 'ф'; // this line additionally assumes that the source character encoding supports this character

Printing such characters out depends on what you're printing to. If you're printing to a Unix terminal emulator, the terminal emulator is using an encoding that supports this character, and that encoding matches the compiler's execution encoding, then you can do the following:

#include <iostream>

int main() {
    std::cout << "Hello, ф or \u0444!\n";
}

This program does not require that 'ф' can be represented in a single char. On OS X and most any modern Linux install this will work just fine, because the source, execution, and console encodings will all be UTF-8 (which supports all Unicode characters).

Things are harder with Windows and there are different possibilities with different tradeoffs.

Probably the best, if you don't need portable code (you'll be using wchar_t, which should really be avoided on every other platform), is to set the mode of the output file handle to take only UTF-16 data.

#include <iostream>
#include <io.h>
#include <fcntl.h>

int main() {
    _setmode(_fileno(stdout), _O_U16TEXT);
    std::wcout << L"Hello, \u0444!\n";
}

Portable code is more difficult.

using unicode in a C++ program

The suitable object to handle Unicode strings in C++ is icu::UnicodeString (check "API References, ICU4C" in the sidebar), at least if you want to really handle Unicode strings (as opposed to just passing them from one point of your application to another).

wchar_t was an early attempt at handling international character sets, which turned out to be a failure because Microsoft's definition of wchar_t as two bytes turned out to be insufficient once Unicode was extended beyond code point 0x10000. Linux defines wchar_t as four bytes, but the inconsistency makes it (and its derived std::wstring) rather useless for portable programming.

TCHAR is a Microsoft define that resolves to char by default and to WCHAR if UNICODE is defined, with WCHAR in turn being wchar_t behind a level of indirection... yeah.

C++11 brought us char16_t and char32_t as well as the corresponding string classes, but those are still instances of basic_string<>, and as such have their shortcomings e.g. when trying to uppercase / lowercase characters that have more than one replacement character (e.g. the German ß would require to be extended to SS in uppercase; the standard library cannot do that).

ICU, on the other hand, goes the full way. For example, it provides normalization and decomposition, which the standard strings do not.

\uxxxx and \UXXXXXXXX are unicode character escapes. The xxxx are a 16-bit hexadecimal number representing a UCS-2 code point, which is equivalent to a UTF-16 code point within the Basic Multilingual Plane.

The XXXXXXXX are a 32-bit hex number, representing a UTF-32 code point, which may be any plane.

How those character escapes are handled depends on the context in which they appear (narrow / wide string, for example), making them somewhat less than perfect.

C++11 introduced "proper" Unicode literals:

u8"..." is always a const char[] in UTF-8 encoding.

u"..." is always a const uchar16_t[] in UTF-16 encoding.

U"..." is always a const uchar32_t[] in UTF-32 encoding.

If you use \uxxxx or \UXXXXXXXX within one of those three, the character literal will always be expanded to the proper code unit sequence.

Note that storing UTF-8 in a std::string is possible, but hazardous. You need to be aware of many things: .length() is not the number of characters in your string. .substr() can lead to partial and invalid sequences. .find_first_of() will not work as expected. And so on.

That being said, in my opinion UTF-8 is the only sane encoding choice for any stored text. There are cases to be made for handling texts as UTF-16 in-memory (the way ICU does), but on file, don't accept anything but UTF-8. It's space-efficient, endianess-independent, and allows for semi-sane handling even by software that is blissfully unaware of Unicode matters (see caveats above).

Unicode characters in C

It's implementation defined, and thus not regulated by the standard.

I know of at least one compiler, namely clang, that requires the source to be UTF-8. But other compilers might use other requirements, or not allow it.

Since C99, identifiers are allowed to contain multi-byte characters, but before C99 it would be an extension to allow non-basic characters there. C11 expanded the set of allowed characters.

There's some additional restrictions on what characters are allowed in identifiers, and © is not in the list. It's listed in appendix D. These are Unicode points, but that doesn't strictly mean the encoding in the file has to be unicode-based.

Ranges of characters allowed

00A8, 00AA, 00AD, 00AF, 00B2−00B5, 00B7−00BA, 00BC−00BE, 00C0−00D6, 00D8−00F6, 00F8−00FF
0100−167F, 1681−180D, 180F−1FFF
200B−200D, 202A−202E, 203F−2040, 2054, 2060−206F
2070−218F, 2460−24FF, 2776−2793, 2C00−2DFF, 2E80−2FFF
3004−3007, 3021−302F, 3031−303F
3040−D7FF
F900−FD3D, FD40−FDCF, FDF0−FE44, FE47−FFFD
10000−1FFFD, 20000−2FFFD, 30000−3FFFD, 40000−4FFFD, 50000−5FFFD, 60000−6FFFD, 70000−7FFFD, 80000−8FFFD, 90000−9FFFD, A0000−AFFFD, B0000−BFFFD, C0000−CFFFD, D0000−DFFFD, E0000−EFFFD

Ranges of characters disallowed initially

0300−036F, 1DC0−1DFF, 20D0−20FF, FE20−FE2F

Is it bad to have accented characters in c++ source code?

The main issue using non-ASCII characters in C++ source is that the compiler must be aware of the encoding used for the source. If the source is 7-bit ASCII then it doesn't usually matter, since most all compilers assume an ASCII compatible encoding by default.

Also not all compilers are configurable as to the encoding, so two compilers might unconditionally use incompatible encodings, meaning that using non-ASCII characters can result in source code that can't be used with both.

GCC: has command-line options for setting the source, execution, and wide execution encodings. The defaults are set by the locale, which usually uses UTF-8 these days.
MSVC: uses so-called 'BOM' to determine source encoding (between UTF-16BE/LE, UTF-8, and the system locale encoding), and always uses the system locale as the execution encoding. edit: As of VS 2015 Update 2, MSVC supports compiler switches to control source and execution charsets, including support for UTF-8. see here
Clang: always uses UTF-8 as the source and execution encodings

So consider what happens with your code to search for an accented character if the string being searched is UTF-8 (perhaps because the execution character set is UTF-8). Whether the character literal 'é' works as you expect or not, you will not be finding accented characters because accented characters won't be represented by any single byte. Instead you'd have to search for various byte sequences.

There are different kinds of escapes which C++ allows in character and string literals. Universal Character Names allow you to designate a Unicode code point, and will be handled exactly as if that character appeared in the source. For example \u00E9 or \U000000E9.

_{(some other languages have \u to support codepoints up to U+FFFF, but lack C++'s support for codepoints beyond that or make you use surrogate code points. You cannot use surrogate codepoints in C++ and instead C++ has the \U variant to support all codepoints directly.)}

UCNs are also supposed to work outside of character and string literals. Outside such literals UCNs are restricted to characters not in the basic source character set. Until recently compilers didn't implement this (C++98) feature, however. Now Clang appears to have pretty complete support, MSVC seems to have at least partial support, and GCC purports to provide experimental support with the option -fextended-identifiers.

Recall that UCNs are supposed to be treated identically with the actual character appearing in the source; Thus compilers with good UCN identifier support also allow you to simply write the identifiers using the actual character so long as the compiler's source encoding supports the character in the first place.

C++ also supports hex escapes. These are \x followed by any number of hexadecimal digits. A hex escape will represent a single integral value, as though it were a single codepoint with that value, and no conversion to the execution charset is done on the value. If you need to represent a specific byte (or char16_t, or char32_t, or wchar_t) value independent of encodings, then this is what you want.

There are also octal escapes but they aren't as commonly useful as UCNs or hex escapes.

Here's the diagnosic that Clang shows when you use 'é' in a source file encoded with ISO-8859-1 or cp1252:

warning: illegal character encoding in character literal [-Winvalid-source-encoding]
    std::printf("%c\n",'<E9>');
                       ^

Clang issues this only as a warning and will just directly output a char object with the source byte's value. This is done for backwards compatibility with non-UTF-8 source code.

If you use UTF-8 encoded source then you get this:

error: character too large for enclosing character literal type
    std::printf("%c\n",'<U+00E9>');
                       ^

Clang detects that the UTF-8 encoding corresponds to the Unicode codepoint U+00E9, and that this code point is outside the range a single char can hold, and so reports an error. (Clang escapes the non-ascii character as well, because it determined that the console it was run under couldn't handle printing the non-ascii character).

Using Unicode in C++ Source Code