Convert Between String, U16String & U32String

Convert between string, u16string & u32string

mbstowcs() and wcstombs() don't necessarily convert to UTF-16 or UTF-32, they convert to wchar_t and whatever the locale wchar_t encoding is. All Windows locales uses a two byte wchar_t and UTF-16 as the encoding, but the other major platforms use a 4-byte wchar_t with UTF-32 (or even a non-Unicode encoding for some locales). A platform that only supports single-byte encodings could even have a one byte wchar_t and have the encoding differ by locale. So wchar_t seems to me to be a bad choice for portability and Unicode. *

Some better options have been introduced in C++11; new specializations of std::codecvt, new codecvt classes, and a new template to make using them for conversions very convienent.

First the new template class for using codecvt is std::wstring_convert. Once you've created an instance of a std::wstring_convert class you can easily convert between strings:

std::wstring_convert<...> convert; // ... filled in with a codecvt to do UTF-8 <-> UTF-16
std::string utf8_string = u8"This string has UTF-8 content";
std::u16string utf16_string = convert.from_bytes(utf8_string);
std::string another_utf8_string = convert.to_bytes(utf16_string);

In order to do different conversion you just need different template parameters, one of which is a codecvt facet. Here are some new facets that are easy to use with wstring_convert:

std::codecvt_utf8_utf16<char16_t> // converts between UTF-8 <-> UTF-16
std::codecvt_utf8<char32_t> // converts between UTF-8 <-> UTF-32
std::codecvt_utf8<char16_t> // converts between UTF-8 <-> UCS-2 (warning, not UTF-16! Don't bother using this one)

Examples of using these:

std::wstring_convert<std::codecvt_utf8_utf16<char16_t>,char16_t> convert;
std::string a = convert.to_bytes(u"This string has UTF-16 content");
std::u16string b = convert.from_bytes(u8"blah blah blah");

The new std::codecvt specializations are a bit harder to use because they have a protected destructor. To get around that you can define a subclass that has a destructor, or you can use the std::use_facet template function to get an existing codecvt instance. Also, an issue with these specializations is you can't use them in Visual Studio 2010 because template specialization doesn't work with typedef'd types and that compiler defines char16_t and char32_t as typedefs. Here's an example of defining your own subclass of codecvt:

template <class internT, class externT, class stateT>
struct codecvt : std::codecvt<internT,externT,stateT>
{ ~codecvt(){} };

std::wstring_convert<codecvt<char16_t,char,std::mbstate_t>,char16_t> convert16;
std::wstring_convert<codecvt<char32_t,char,std::mbstate_t>,char32_t> convert32;

The char16_t specialization converts between UTF-16 and UTF-8. The char32_t specialization, UTF-32 and UTF-8.

Note that these new conversions provided by C++11 don't include any way to convert directly between UTF-32 and UTF-16. Instead you just have to combine two instances of std::wstring_convert.

***** I thought I'd add a note on wchar_t and its purpose, to emphasize why it should not generally be used for Unicode or portable internationalized code. The following is a short version of my answer https://stackoverflow.com/a/11107667/365496

What is wchar_t?

wchar_t is defined such that any locale's char encoding can be converted to wchar_t where every wchar_t represents exactly one codepoint:

Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales (22.3.1). -- [basic.fundamental] 3.9.1/5

This does not require that wchar_t be large enough to represent any character from all locales simultaneously. That is, the encoding used for wchar_t may differ between locales. Which means that you cannot necessarily convert a string to wchar_t using one locale and then convert back to char using another locale.

Since that seems to be the primary use in practice for wchar_t you might wonder what it's good for if not that.

The original intent and purpose of wchar_t was to make text processing simple by defining it such that it requires a one-to-one mapping from a string's code-units to the text's characters, thus allowing the use of same simple algorithms used with ascii strings to work with other languages.

Unfortunately the requirements on wchar_t assume a one-to-one mapping between characters and codepoints to achieve this. Unicode breaks that assumption, so you can't safely use wchar_t for simple text algorithms either.

This means that portable software cannot use wchar_t either as a common representation for text between locales, or to enable the use of simple text algorithms.

What use is wchar_t today?

Not much, for portable code anyway. If __STDC_ISO_10646__ is defined then values of wchar_t directly represent Unicode codepoints with the same values in all locales. That makes it safe to do the inter-locale conversions mentioned earlier. However you can't rely only on it to decide that you can use wchar_t this way because, while most unix platforms define it, Windows does not even though Windows uses the same wchar_t locale in all locales.

The reason Windows doesn't define __STDC_ISO_10646__ I think is because Windows use UTF-16 as its wchar_t encoding, and because UTF-16 uses surrogate pairs to represent codepoints greater than U+FFFF, which means that UTF-16 doesn't satisfy the requirements for __STDC_ISO_10646__.

For platform specific code wchar_t may be more useful. It's essentially required on Windows (e.g., some files simply cannot be opened without using wchar_t filenames), though Windows is the only platform where this is true as far as I know (so maybe we can think of wchar_t as 'Windows_char_t').

In hindsight wchar_t is clearly not useful for simplifying text handling, or as storage for locale independent text. Portable code should not attempt to use it for these purposes.

std::u32string conversion to/from std::string and std::u16string

If you read the documentation at CppReference.com for wstring_convert, codecvt_utf8, codecvt_utf16, and codecvt_utf8_utf16, the pages include a table that tells you exactly what you can use for the various UTF conversions.

table

And yes, you would use std::wstring_convert to facilitate the conversion between the various UTFs. Despite its name, it is not limited to just std::wstring, it actually operates with any std::basic_string type (which std::string, std::wstring, and std::uXXstring are all based on).

Class template std::wstring_convert performs conversions between byte string std::string and wide string std::basic_string<Elem>, using an individual code conversion facet Codecvt. std::wstring_convert assumes ownership of the conversion facet, and cannot use a facet managed by a locale. The standard facets suitable for use with std::wstring_convert are std::codecvt_utf8 for UTF-8/UCS2 and UTF-8/UCS4 conversions and std::codecvt_utf8_utf16 for UTF-8/UTF-16 conversions.

For example:

typedef std::string u8string;

u8string To_UTF8(const std::u16string &s)
{
    std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> conv;
    return conv.to_bytes(s);
}

u8string To_UTF8(const std::u32string &s)
{
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv;
    return conv.to_bytes(s);
}

std::u16string To_UTF16(const u8string &s)
{
    std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> conv;
    return conv.from_bytes(s);
}

std::u16string To_UTF16(const std::u32string &s)
{
    std::wstring_convert<std::codecvt_utf16<char32_t>, char32_t> conv;
    std::string bytes = conv.to_bytes(s);
    return std::u16string(reinterpret_cast<const char16_t*>(bytes.c_str()), bytes.length()/sizeof(char16_t));
}

std::u32string To_UTF32(const u8string &s)
{
    std::wstring_convert<codecvt_utf8<char32_t>, char32_t> conv;
    return conv.from_bytes(s);
}

std::u32string To_UTF32(const std::u16string &s)
{
    const char16_t *pData = s.c_str();
    std::wstring_convert<std::codecvt_utf16<char32_t>, char32_t> conv;
    return conv.from_bytes(reinterpret_cast<const char*>(pData), reinterpret_cast<const char*>(pData+s.length()));
}

How to convert std::string to std::u32string?

To initialise a new string:

std::u32string s32(s.begin(), s.end());

To assign to an existing string:

s32.assign(s.begin(), s.end());

If the string might contain characters outside the supported range of char, then this might cause sign-extension issues, converting negative values into large positive values. Dealing with that possibility is messier; you'll have to convert to unsigned char before widening the value.

s32.resize(s.size());
std::transform(s.begin(), s.end(), s32.begin(), 
               [](char c) -> unsigned char {return c;});

or a plain loop

s32.clear();  // if not already empty
for (unsigned char c : s) {s32 += c;}

Difference between std::string and std::u16string (or u32string)

When you do:

char16_t x[] = { 'a', 'b', 'c', 0 };

It is similar to doing this (endianness not withstanding):

char x[] = { '\0', 'a', '\0', 'b', '\0', 'c', '\0', '\0' };

Each character occupies two bytes in memory.

So when you ask for the length of a u16string each two bytes is counted as one character. They are, after all, two-byte (16bit) characters.

EDIT:

Your additional question is creating a string without a null terminator.

Try this:

char x[] = { 'a', 'b', 'c', 0 , 0, 0};
u16string arr = (char16_t*)x;

Now the first character is {'a', 'b'} the second character is {'c', 0} and you also have a null terminator character {0, 0}.

std::string, wstring, u16/32string clarification

The difference is that the details of char and wchar_t are implementation defined, while the encoding of char16_t and char32_t are explicitly defined by the C++11 standard.

This means that wstring is likely to store the same data as either u16string or u32string, but we don't know which one. And it is allowed for some odd implementation to make them all different, as the size and encoding of the old char types are just not defined by the standard.

How do you use wstring_convert to convert between utf16 and utf32?

If you do not want to reinterpret_cast, the only way I've found is to first convert to utf-8, then reconvert to utf-32.

For ex,

// Convert to utf-8.
std::u16string s;
std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> conv;
std::string utf8_str = conv.to_bytes(s);

// Convert to utf-32.
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv;
std::u32string utf32_str = conv.from_bytes(utf8_str);

Yes this is sad and likely contributes to codecvt deprecation.

std::u16string, std::u32string, std::string, length(), size(), codepoints and characters

u16string and u32string are not "new C++11 classes". They're just typedefs of std::basic_string for char16_t and cha32_t types.

length is always equal to size for any basic_string. It is the number of T's in the string, where T is the template type for the basic_string.

basic_string is not Unicode aware in any way, shape, or form. It has no concept of codepoints, graphemes, Unicode characters, Unicode normalization, or anything of the kind. It is simply a ordered sequence of Ts. The only thing that is Unicode-aware about u16string and u32string is that they use the type returned by u"" and U"" literals. Thus, they can store Unicode-encoded strings, but they do nothing that requires knowledge of said encoding.

Iterators iterate over elements of T, not "bytes, codepoints, or characters". If T is char16_t, then it will iterate over char16_ts. If the string is UTF-16-encoded, then it is iterating over UTF-16 code units, not Unicode codepoints or bytes.

Write a std::u16string to file?

You need to create a basic_ofstream whose char type is equivalent to the u16string value type, which should be char16_t:

typedef std::basic_ofstream<char16_t> u16ofstream;

u16ofstream outfile("words.txt", std::ios_base::app);
outfile << someu16string; 
outfile.close();

Alternatively, you can convert the u16string to a regular string and just write it to a plain ofstream, but you'll have to handle the conversion and deal with the string's encoding on your own.

wofstream may also be compatible with u16string on some platforms (specifically Windows), but it's not portable.

Convert Between String, U16String & U32String