What Does '<Cuchar>' Provide, and Where Is It Documented

What does ` cuchar ` provide, and where is it documented?

These were described in a WG21 paper from 2005 but the description is not present in the final standard. They are documented in ISO/IEC 19769:2004 (Extensions for the programming language C to support new character data types) (draft), which the C++11 standard refers to.

The text is too long to post here, but these are the signatures:

size_t mbrtoc16(char16_t * pc16, const char * s, size_t n, mbstate_t * ps);
size_t c16rtomb(char * s, char16_t c16, mbstate _t * ps);
size_t mbrtoc32(char32_t * pc32, const char * s, size_t n, mbstate_t * ps);
size_t c32rtomb(char * s, char32_t c32, mbstate_t * ps);

The functions convert between multibyte characters and UTF-16 or UTF-32 characters, respectively, similar to mbrtowc. There are no non-reentrant versions, and honestly, who needs them?

Where are the fields documented for the unicode.org file UnicodeData.txt ?

update

sorry, I misread the question. Still, I think the information is in the link you provided, under section UnicodeData.txt. For each field, a link inside the document lists its values if applicable. Seems to be the same list as in the 3.0 version.

clang: converting const char16_t* (UTF-16) to wstring (UCS-4)

Two errors:

1) from_bytes() overload that takes the single const char* expects a null-terminated byte string, but your very second byte is '\0'.

2) your system is likely little-endian, so you need to convert from UTF-16LE to UCS-4:

#include <iostream>
#include <locale>
#include <memory>
#include <codecvt>
#include <string>

using namespace std;

int main()
{
    u16string s;

    s.push_back('h');
    s.push_back('e');
    s.push_back('l');
    s.push_back('l');
    s.push_back('o');

    wstring_convert<codecvt_utf16<wchar_t, 0x10ffff, little_endian>,
                     wchar_t> conv;
    wstring ws = conv.from_bytes(
                     reinterpret_cast<const char*> (&s[0]),
                     reinterpret_cast<const char*> (&s[0] + s.size()));

    wcout << ws << endl;

    return 0;
}

Tested with Visual Studio 2010 SP1 on Windows and CLang++/libc++-svn on Linux.

UTF-8-compliant IOstreams

Your question doesn't quite work. UTF-8 is a specific encoding, while wchar_t is a data type. Moreover, wchar_t is intended by the standard to represent the system's character set, but this is entirely left to platform, and the standard makes no requirements.

Therefore, the correct thing to ask for is first of all conversion between the system's narrow, multibyte encoding and the fixed-length encoding of the system's encoding into a wide string. This functionality is provided by std::mbstowcs and std::wcstombs. There may also be a locale facet somewhere that wraps this, but that's a bit of a niche area of the library.

If you want to convert between the opaque "system's encoding" prescribed by the standard and a definite encoding prescribed by your serialized data source/sink, you need an extra library. I'd recommend Posix's iconv(), which is widely available. (The Windows API has a different approach and offers special functions for conversion.)

C++11 alleviates the issue slightly by adding an explicit family of UTF-encoded string types and literals, and presumably also transcoding facilities among those (though I've never seen them implemented by anyone).

Here's my standard response of past posts on the subject: Q1, Q2, Q3. C++11 will be a joy once its fully available :-)

Working with unicode strings as std::vector unsigned short

libiconv, icu, UTF8-CPP, and others can do this. AFAIK, C++ does not have a portable way to convert between UTF8/16/32. Keep in mind that std::wstring is UTF16 on some systems, and UTF32 on others.

What Does '<Cuchar>' Provide, and Where Is It Documented