Using Char16_T and Char32_T in I/O

Using char16_t and char32_t in I/O

In the proposal Minimal Unicode support for the standard library (revision 2) it is indicated that there was only support among the Library Working Group for supporting the new character types in strings and codecvt facets. Apparently the majority was opposed to supporing iostream, fstream, facets other than codecvt, and regex.

According to minutes from the Portland meeting in 2006 "the LWG is committed to full support of Unicode, but does not intend to duplicate the library with Unicode character variants of existing library facilities." I haven't found any details, however I would guess that the committee feels that the current library interface is inappropriate for Unicode. One possible complaint could be that it was designed with fixed sized characters in mind, but Unicode completely obsoletes that as, while Unicode data can use fixed sized code points, it does not limit characters to single code points.

Personally I think there's no reason not to standardized the minimal support that's already provided on various platforms (Windows uses UTF-16 for wchar_t, most Unix platforms use UTF-32). More advanced Unicode support will require new library facilities, but supporting char16_t and char32_t in iostreams and facets won't get in the way but would enable basic Unicode i/o.

char16_t and char32_t endianness

char16_t and char32_t do not guarantee Unicode encoding. (That is a C++ feature.) The macros __STDC_UTF_16__ and __STDC_UTF_32__, respectively, indicate that Unicode code points actually determine the fixed-size character values. See C11 §6.10.8.2 for these macros.

(By the way, __STDC_ISO_10646__ indicates the same thing for wchar_t, and it also reveals which Unicode edition is implemented via wchar_t. Of course, in practice, the compiler simply copies code points from the source file to strings in the object file, so it doesn't need to know much about particular characters.)

Given that Unicode encoding is in effect, code point values stored in char16_t or char32_t must have the same object representation as uint_least16_t and uint_least32_t, because they are defined to be typedef aliases to those types, respectively (C11 §7.28). This is again somewhat in contrast to C++, which makes those types distinct but explicitly requires compatible object representation.

The upshot is that yes, there is nothing special about char16_t and char32_t. They are ordinary integers in the platform's endianness.

However, your test program has nothing to do with endianness. It simply uses the values of the wide characters without inspecting how they map to bytes in memory.

Are `char16_t` and `char32_t` misnomers?

The naming convention to which you refer (uint32_t, int_fast32_t, etc.) is actually only used for typedefs, and not for primitive types. The primitive integer types are {signed, unsigned} {char, short, int, long, long long}, {as opposed to float or decimal types} ...

However, in addition to those integer types, there are four distinct, unique, fundamental types, char, wchar_t, char16_t and char32_t, which are the types of the respective literals '', L'', u'' and U'' and are used for alpha-numeric type data, and similarly for arrays of those. Those types are of course also integer types, and thus they will have the same layout at some of the arithmetic integer types, but the language makes a very clear distinction between the former, arithmetic types (which you would use for computations) and the latter "character" types which form the basic unit of some type of I/O data.

(I've previously rambled about those new types here and here.)

So, I think that char16_t and char32_t are actually very aptly named to reflect the fact that they belong to the "char" family of integer types.

Why did C++11 introduce the char16_t and char32_t types

1 byte has never been enough. There are hundreds of Ansi 8bit encodings in existence because people kept trying to stuff different languages into the confines of 8bit limitations, thus the same byte values have different meanings in different languages. Then Unicode came along to solve that problem, but it needed 16 bits to do it (UCS-2). Eventually, the needs of the world's languages exceeded 16bit, so UTF-8/16/32 encodings were created to extend the available values.

char16_t and char32_t (and their respective text prefixes), were created to handle UTF-16/32 in a uniform manner on all platforms. Originally, there was wchar_t, but it was created when Unicode was new, and its byte size was never standardized, even to this day. On some platforms, wchar_t is 16bit (UTF-16), whereas on other platforms it is 32bit (UTF-32) instead. This has caused plenty of interoperability issues over the years when exchanging Unicode data across platforms. char16_t and char32_t were finally introduced to have standardized sizes - 16bit and 32bit, respectively - and semantics on all platforms.

Using char16_t, char32_t etc without C++ 11?

Checking whether this types are supported is a platform-dependent thing, I think. For example, GCC defines: __CHAR16_TYPE__ and __CHAR32_TYPE__ if these types are provided (requires either ISO C11 or C++ 11 support).

However, you cannot check for their presence directly, because they are fundamental types, not macros:

In C++, char16_t and char32_t are fundamental types (and thus this header does not define such macros in C++).

However, you could check for C++ 11 support. According to Bjarne Stroustrup's page:

__cplusplus

In C++11 the macro __cplusplus will be set to a value that differs from (is greater than) the current 199711L.

So, basically, you could do:

#if __cplusplus > 199711L
// Has C++ 11, so we can assume presence of `char16_t` and `char32_t`
#else
// No C++ 11 support, define our own
#endif

How define your own?

-> MSVC, ICC on Windows: use platform-specific types, supported in VS .NET 2003 and newer:

typedef __int16 int16_t;
typedef unsigned __int16 uint16_t;
typedef __int32 int32_t;
typedef unsigned __int32 uint32_t;

-> GCC, MinGW, ICC on Linux: these have full C99 support, so use types from <cstdint> and don't typedef your own (you may want to check version or compiler-specific macro).

And then:

typedef int16_t char16_t;
typedef uint16_t uchar16_t;
typedef int32_t char32_t;
typedef uint32_t uchar32_t;

How to check what compiler is in use? Use this great page ('Compilers' section).

I'm trying to print a Chinese character using the types wchar_t, char16_t and char32_t, to no avail.

Since you're running your test on a Linux system, source code is UTF-8, which is why x and y are the same thing. Those bytes are shunted, unmodified, into the standard output by std::cout << x and std::cout << y, and when you view the web page (or when you look at the linux terminal), you see the character as you expected.

std::wcout << z will print if you do two things:

std::ios::sync_with_stdio(false);
std::wcout.imbue(std::locale("en_US.utf8"));

without unsynching from C, GNU libstdc++ goes through C IO streams, which can never print a wide char after printing a narrow char on the same stream. LLVM libc++ appears to work even synched, but of course still needs the imbue to tell the stream how to convert the wide chars to the bytes it sends into the standard output.

To print b and a, you will have to convert them to wide or narrow; even with wbuffer_convert setting up a char32_t stream is a lot of work. It would look like this:

std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv32;
std::cout << conv32.to_bytes(a) << '\n';

Putting it all together: http://coliru.stacked-crooked.com/a/a809c38e21cc1743

Why does `std::basic_ifstreamchar16_t` not work in c++11?

The various stream classes need a set of definitions to be operational. The standard library requires the relevant definitions and objects only for char and wchar_t but not for char16_t or char32_t. Off the top of my head the following is needed to use std::basic_ifstream<cT> or std::basic_ofstream<cT>:

std::char_traits<cT> to specify how the character type behaves. I think this template is specialized for char16_t and char32_t.
The used std::locale needs to contain an instance of the std::num_put<cT> facet to format numeric types. This facet can just be instantiated and a new std::locale containing it can be created but the standard doesn't mandate that it is present in a std::locale object.
The used std::locale needs to contain an instance of the facet std::num_get<cT> to read numeric types. Again, this facet can be instantiated but isn't required to be present by default.
the facet std::numpunct<cT> needs to be specialized and put into the used std::locale to deal with decimal points, thousand separators, and textual boolean values. Even if it isn't really used it will be referenced from the numeric formatting and parsing functions. There is no ready specialization for char16_t or char32_t.
The facet std::ctype<cT> needs to be specialized and put into the used facet to support widening, narrowing, and classification of the character type. There is no ready specialization for char16_t or char32_t.
1. The facet std::codecvt<cT, char, std::mbstate_t> needs to be specialized and put into the used std::locale to convert between external byte sequences and internal "character" sequences. There is no ready specialization for char16_t or char32_t.

Most of the facets are reasonably easy to do: they just need to forward a simple conversion or do table look-ups. However, the std::codecvt facet tends to be rather tricky, especially because std::mbstate_t is an opaque type from the point of view of the standard C++ library.

All of that can be done. It is a while since I last did a proof of concept implementation for a character type. It took me about a day worth of work. Of course, I knew what I need to do when I embarked on the work having implemented the locales and IOStreams library before. To add a reasonable amount of tests rather than merely having a simple demo would probably take me a week or so (assuming I can actually concentrate on this work).

fully emulating missing distinct builtin types (specifically: char16_t and char32_t)

I don't think you will get the initialization to work because there isn't much scope to get it to work. The problem is that the initialization you are using in your example isn't supposed to work: the string literal u"..." yields an an array of char16_t const objects and you want to initialize a pointer with it:

char16_t const* c16 = u"...";

Also, without implementation of char16_t in the compiler it is very unlikely to support char16_t string literals. The best you could achieve is to play macro tricks which are intended to do the Right Thing. For now, you'd use e.g. wide character literals and when you get a compiler which support char16_t you just change the macro to use char16_t literals. Even for this to work you might need to use a record type which is bigger than 16 bit because wchar_t uses 32 bits on some platforms.

#define CONCAT(a,b) a##b

#if defined(HAS_C16)
#  define C16S(s) CONCAT(u,s)
#else
#  define C16S(s) reinterpret_cast<char16_t const*>(CONCAT(L,s));
struct char16_t
{
    unsigned short value;
};
#endif

int main()
{
    char16_t const* c16 = C16S("...");
}

Obviously, you still need to provide all kinds of operators e.g. to make integer arithmetic and suitable conversions work.

Using Char16_T and Char32_T in I/O