Unicode Encoding For String Literals in C++11

Unicode encoding for string literals in C++11

Are the \x/\u/\U character references freely combinable with all string types?

No. \x can be used in anything, but \u and \U can only be used in strings that are specifically UTF-encoded. However, for any UTF-encoded string, \u and \U can be used as you see fit.

Are all the string types fixed-width, i.e. the arrays contain precisely as many elements as appear in the literal, or to \x/\u/\U references get expanded into a variable number of bytes?

Not in the way you mean. \x, \u, and \U are converted based on the string encoding. The number of those "code units" (using Unicode terms. A char16_t is a UTF-16 code unit) values depends on the encoding of the containing string. The literal u8"\u1024" would create a string containing 2 chars plus a null terminator. The literal u"\u1024" would create a string containing 1 char16_t plus a null terminator.

The number of code units used is based on the Unicode encoding.

Do u"" and u8"" strings have encoding semantics, e.g. can I say char16_t x[] = u"\U0010FFFF", and the non-BMP codepoint gets encoded into a two-unit UTF16 sequence?

u"" creates a UTF-16 encoded string. u8"" creates a UTF-8 encoded string. They will be encoded per the Unicode specification.

In (1), can I write lone surrogates with \u?

Absolutely not. The specification expressly forbids using the UTF-16 surrogate pairs (0xD800-0xDFFF) as codepoints for \u or \U.

Finally, are any of the string functions encoding aware (i.e. they are character-aware and can detect invalid byte sequences)?

Absolutely not. Well, allow me to rephrase that.

std::basic_string doesn't deal with Unicode encodings. They certainly can store UTF-encoded strings. But they can only think of them as sequences of char, char16_t, or char32_t; they can't think of them as a sequence of Unicode codepoints that are encoded with a particular mechanism. basic_string::length() will return the number of code units, not code points. And obviously, the C standard library string functions are totally useless

It should be noted however that "length" for a Unicode string does not mean the number of codepoints. Some code points are combining "characters" (an unfortunate name), which combine with the previous codepoint. So multiple codepoints can map to a single visual character.

Iostreams can in fact read/write Unicode-encoded values. To do so, you will have to use a locale to specify the encoding and properly imbue it into the various places. This is easier said than done, and I don't have any code on me to show you how.

How does file encoding affect C++11 string literals?

In GCC, use -finput-charset=charset:

Set the input character set, used for translation from the character set of the input file to the source character set used by GCC. If the locale does not specify, or GCC cannot get this information from the locale, the default is UTF-8. This can be overridden by either the locale or this command line option. Currently the command line option takes precedence if there's a conflict. charset can be any encoding supported by the system's "iconv" library routine.

Also check out the options -fexec-charset and -fwide-exec-charset.

Finally, about string literals:

char     a[] = "Hello";
wchar_t  b[] = L"Hello";
char16_t c[] = u"Hello";
char32_t d[] = U"Hello";

The size modifier of the string literal (L, u, U) merely determines the type of the literal.

C++ Literals and Unicode

First see: http://en.cppreference.com/w/cpp/language/string_literal

std::cout's class operator << is properly overloaded to print const char*. That is why the first two strings are printed.

cout << "\n\n Hello World! (plain) \n";
cout << u8"\n Hello World! (u8) \n";

As expected, prints¹:

Hello World! (plain)

Hello World! (u8)

Meanwhile std::cout's class has no special << overload for const char16_t*, const char32_t* and const wchar_t*, hence it will match <<'s overload for printing pointers, that is why:

cout << u"\n Hello World! (u) \n";
cout << U"\n Hello World! (U) \n";
cout << L"\n Hello World! (plain) \n\n";

Prints:

0x47f0580x47f0840x47f0d8

As you can see, there are actually 3 pointer values printed there: 0x47f058, 0x47f084 and 0x47f0d8

However, for the last one, you can get it to print properly using std::wcout

std::wcout << L"\n Hello World! (plain) \n\n";

prints

 Hello World! (plain)

^{1: The u8 literal printed as expected because of the direct ASCII mapping of the first few codepoints of UTF-8.}

Is it ok to write the unicode characters inside string literals in C++

Whether you can use Unicode characters in your source code (not just in string literals) is implementation-defined. The only way to be portable is to stick to characters in the "basic source character set" and use u8"\u00a7some-text".

[lex.phases]/1:

Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (e.g., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal.)

The "basic source character set" is:

The basic source character set consists of 96 characters: the space character, the control characters representing horizontal tab, vertical tab, form feed, and new-line, plus the following 91 graphical characters:

a b c d e f g h i j k l m n o p q r s t u v w x y z

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

0 1 2 3 4 5 6 7 8 9

_ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ~ ! = , \ " ’

Why is there no ASCII or UTF-8 character literal in C11 or C++11?

It is perfectly acceptable to write non-portable C code, and this is one of many good reasons to do so. Feel free to assume that your system uses ASCII or some superset thereof, and warn your users that they shouldn't try to run your program on an EBCDIC system.

If you are feeling very generous, you can encode a check. The gperf program is known to generate code that includes such a check.

_Static_assert('0' == 48, "must be ASCII-compatible");

Or, for pre-C11 compilers,

extern int must_be_ascii_compatible['0' == 48 ? 1 : -1];

If you are on C11, you can use the u or U prefix on character constants, but not the u8 prefix...

/* This is useless, doesn't do what you want... */
_Static_assert(0, "this code is broken everywhere");
if (c == '々') ...

/* This works as long as wchar_t is UTF-16 or UTF-32 or UCS-2... */
/* Note: you shouldn't be using wchar_t, though... */
_Static_assert(__STDC_ISO_10646__, "wchar_t must be some form of Unicode");
if (c == L'々') ...

/* This works as long as char16_t is UTF-16 or UCS-2... */
_Static_assert(__STDC_UTF_16__, "char16_t must be UTF-16");
if (c == u'々') ...

/* This works as long as char32_t is UTF-32... */
_Static_assert(__STDC_UTF_32__, "char32_t must be UTF-32");
if (c == U'々') ...

There are some projects that are written in very portable C and have been ported to non-ASCII systems (example). This required a non-trivial amount of porting effort, and there's no real reason to make the effort unless you know you want to run your code on EBCDIC systems.

On standards: The people writing the C standard have to contend with every possible C implementation, including some downright bizarre ones. There are known systems where sizeof(char) == sizeof(long), CHAR_BIT != 8, integral types have trap representations, sizeof(void *) != sizeof(int *), sizeof(void *) != sizeof(void (*)()), va_list are heap-allocated, etc. It's a nightmare.

Don't beat yourself up trying to write code that will run on systems you've never even heard of, and don't search to hard for guarantees in the C standard.

For example, as far as the C standard is concerned, the following is a valid implementation of malloc:

void *malloc(void) { return NULL; }

Note that while u8"..." constants are guaranteed to be UTF-8, u"..." and U"..." have no guarantees except that the encoding is 16-bits and 32-bits per character, respectively, and the actual encoding must be documented by the implementation.

Summary: Safe to assume ASCII compatibility in 2012.

Restrictions to Unicode escape sequences in C11

It's precisely to avoid alternative spellings.

The primary motivations for adding Universal Character Names (UCNs) to C and C++ were to:

allow identifiers to include letters outside of the basic source character set (like ñ, for example).
allow portable mechanisms for writing string and character literals which include characters outside of the basic source character set.

Furthermore, there was a desire that the changes to existing compilers be as limited as possible, and in particular that compilers (and other tools) could continue to use their established (and often highly optimised) lexical analysis functions.

That was a challenge, because there are huge differences in the lexical analysis architectures of different compilers. Without going into all the details, it appeared that two broad implementation strategies were possible:

The compiler could internally use some single universal encoding, such as UTF-8. All input files in other encodings would be transcribed into this internal encoding very early in the input pipeline. Also, UCNs (wherever they appeared) would be converted to the corresponding internal encoding. This latter transformation could be conducted in parallel with continuation line processing, which also requires detecting backslashes, thus avoiding an extra test on every input character for a condition which very rarely turns out to be true.
The compiler could internally use strict (7-bit) ASCII. Input files in encodings allowing other characters would be transcribed into ASCII with non-ASCII characters converted to UCNs prior to any other lexical analysis.

In effect, both of these strategies would be implemented in Phase 1 (or equivalent), which is long before lexical analysis has taken place. But note the difference: strategy 1 converts UCNs to an internal character coding, while strategy 2 converts non-representable characters to UCNs.

What these two strategies have in common is that once the transcription is finished, there is no longer any difference between a character entered directly into the source stream (in whatever encoding the source file uses) and a character described with a UCN. So if the compiler allows UTF-8 source files, you could enter an ñ as either the two bytes 0xc3, 0xb1 or as the six-character sequence \u00D1, and they would both end up as the same byte sequence. That, in turn, means that every identifier has only one spelling, so no change is necessary (for example) to symbol table lookup.

Typically, compilers just pass variable names through the compilation pipeline, leaving them to be eventually handled by assemblers or linkers. If these downstream tools do not accept extended character encodings or UCNs (depending on implementation strategy) then names containing such characters need to be "mangled" (transcribed) in order to make them acceptable. But even if that's necessary, it's a minor change and can be done at a well-defined interface.

Rather than resolve arguments between compiler vendors whose products (or development teams) had clear preferences between the two strategies, the C and C++ standards committees chose mechanisms and restrictions which make both strategies compatible. In particular, both committees forbid the use of UCNs which represent characters which already have an encoding in the basic source character set. That avoids questions like:

What happens if I put \u0022 inside a string literal:
```
  const char* quote = "\u0022";
```
If the compiler translates UCNs to the characters they represent, then by the time the lexical analyser sees that line, "\u0022" will have been converted to """, which is a lexical error. On the other hand, a compiler which retains UCNs until the end would happily accept that as a string literal. Banning the use of a UCN which represents a quotation mark avoids this possible non-portability.
Similarly, would '\u005cn' be a newline character? Again, if the UCN is converted to a backslash in Phase 1, then in Phase 3 the string literal would definitely be treated as a newline. But if the UCN is converted to a character value only after the character literal token has been identified as such, then the resulting character literal would contain two characters (an implementation-defined value).
And what about 2 \u002B 2? Is that going to look like an addition, even though UCNs aren't supposed to be used for punctuation characters? Or will it look like an identifier starting with a non-letter code?

And so on, for a large number of similar issues.

All of these details are avoided by the simple expedient of requiring that UCNs cannot be used to spell characters in the basic source character set. And that's what was embodied in the standards.

Note that the "basic source character set" does not contain every ASCII character. It does not contain the majority of the control characters, and nor does it contain the ASCII characters $, @ and `. These characters (which have no meaning in a C or C++ program outside of string and character literals) can be written as the UCNs \u0024, \u0040 and \u0060 respectively.

Finally, in order to see what sort of knots you need to untie in order to correctly lexically analyse C (or C++), consider the following snippet:

const char* s = "\\
n";

Because continuation lines are dealt with in Phase 1, prior to lexical analysis, and Phase 1 only looks for the two-character sequence consisting of a backslash followed by a newline, that line is the same as

const char* s = "\n";

But that might not have been obvious looking at the original code.

Choose default type and encoding for C++ string literals at compile time

Unfortunately the solution to this problem is to use the macros. Although @Nadim Farhat pointed out that you can do a certain amount of choosing with gcc it is by no means a portable solution.

c++ string literal still confusing

C++ doesn't have normal Unicode support. You just can't wirte normally globalized application in C++ without using 3rd party libraries. Read this insightful SO answer. If you really need to write an application which uses Unicode I'd look at ICU library.

Compare unicode std::string with usual literal or u8 declartion

"\u00dc" is a char[] encoded in whatever the compiler/OS's default 8-bit encoding happens to be, so it can be different on different machines. On Windows, that tends to be the OS's default Ansi encoding, or it could be the encoding that the source file is saved as.

L"\u00dc" is a wchar_t[] encoded with either UTF-16 or UTF-32, depending on the compiler's definition of wchar_t (which is 16-bit on Windows, so UTF-16).

u8"\u00dc" is a char[] encoded in UTF-8.

u"\u00dc" is a char16_t[] encoded in UTF-16.

U"\u00dc" is a char32_t[] encoded in UTF-32.

The ""s suffix simply returns a std::string, std::wstring, std::u16string, or std::u32string, depending on whether a char[], wchar_t[], char16_t[], or char32_t[] is passed to it.

When comparing two strings, make sure they are in the same encoding first. This is especially important for your char[]/std::string data, as it could be in any number of 8-bit encodings, depending on the systems involved. This is not so much a problem if the app is generating the strings itself, but it is important if one or more of the strings is coming from an external source (file, user input, network protocol, etc).

In your example, "\u00dc" and "Ü" are not necessarily guaranteed to produce the same char[] sequence, depending on how the compiler interprets those different literals. But even if they did (which seems to be the case in your example), neither of them will likely produce UTF-8 (you have to go to extra measures to force that), which is why your comparison to utf8_encoded_string_s fails.

So, if you are expecting a string literal to be UTF-8, use u8"" to ensure that. If you are getting string data from an external source and need it to be in UTF-8, convert it to UTF-8 in code as soon as possible, if it is not already (which means you have to know the encoding used by the external source).

Unicode Encoding For String Literals in C++11