What's "Wrong" With C++ Wchar_T and Wstrings? What Are Some Alternatives to Wide Characters

What's wrong with C++ wchar_t and wstrings? What are some alternatives to wide characters?

What is wchar_t?

wchar_t is defined such that any locale's char encoding can be converted to a wchar_t representation where every wchar_t represents exactly one codepoint:

Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales (22.3.1).

                                                                               — C++ [basic.fundamental] 3.9.1/5

This does not require that wchar_t be large enough to represent any character from all locales simultaneously. That is, the encoding used for wchar_t may differ between locales. Which means that you cannot necessarily convert a string to wchar_t using one locale and then convert back to char using another locale.1

Since using wchar_t as a common representation between all locales seems to be the primary use for wchar_t in practice you might wonder what it's good for if not that.

The original intent and purpose of wchar_t was to make text processing simple by defining it such that it requires a one-to-one mapping from a string's code-units to the text's characters, thus allowing the use of the same simple algorithms as are used with ascii strings to work with other languages.

Unfortunately the wording of wchar_t's specification assume a one-to-one mapping between characters and codepoints to achieve this. Unicode breaks that assumption2, so you can't safely use wchar_t for simple text algorithms either.

This means that portable software cannot use wchar_t either as a common representation for text between locales, or to enable the use of simple text algorithms.

What use is wchar_t today?

Not much, for portable code anyway. If __STDC_ISO_10646__ is defined then values of wchar_t directly represent Unicode codepoints with the same values in all locales. That makes it safe to do the inter-locale conversions mentioned earlier. However you can't rely only on it to decide that you can use wchar_t this way because, while most unix platforms define it, Windows does not even though Windows uses the same wchar_t locale in all locales.

The reason Windows doesn't define __STDC_ISO_10646__ is because Windows use UTF-16 as its wchar_t encoding, and because UTF-16 uses surrogate pairs to represent codepoints greater than U+FFFF, which means that UTF-16 doesn't satisfy the requirements for __STDC_ISO_10646__.

For platform specific code wchar_t may be more useful. It's essentially required on Windows (e.g., some files simply cannot be opened without using wchar_t filenames), though Windows is the only platform where this is true as far as I know (so maybe we can think of wchar_t as 'Windows_char_t').

In hindsight wchar_t is clearly not useful for simplifying text handling, or as storage for locale independent text. Portable code should not attempt to use it for these purposes. Non-portable code may find it useful simply because some API requires it.

Alternatives

The alternative I like is to use UTF-8 encoded C strings, even on platforms not particularly friendly toward UTF-8.

This way one can write portable code using a common text representation across platforms, use standard datatypes for their intended purpose, get the language's support for those types (e.g. string literals, though some tricks are necessary to make it work for some compilers), some standard library support, debugger support (more tricks may be necessary), etc. With wide characters it's generally harder or impossible to get all of this, and you may get different pieces on different platforms.

One thing UTF-8 does not provide is the ability to use simple text algorithms such as are possible with ASCII. In this UTF-8 is no worse than any other Unicode encoding. In fact it may be considered to be better because multi-code unit representations in UTF-8 are more common and so bugs in code handling such variable width representations of characters are more likely to be noticed and fixed than if you try to stick to UTF-32 with NFC or NFKC.

Many platforms use UTF-8 as their native char encoding and many programs do not require any significant text processing, and so writing an internationalized program on those platforms is little different from writing code without considering internationalization. Writing more widely portable code, or writing on other platforms requires inserting conversions at the boundaries of APIs that use other encodings.

Another alternative used by some software is to choose a cross-platform representation, such as unsigned short arrays holding UTF-16 data, and then to supply all the library support and simply live with the costs in language support, etc.

C++11 adds new kinds of wide characters as alternatives to wchar_t, char16_t and char32_t with attendant language/library features. These aren't actually guaranteed to be UTF-16 and UTF-32, but I don't imagine any major implementation will use anything else. C++11 also improves UTF-8 support, for example with UTF-8 string literals so it won't be necessary to trick VC++ into producing UTF-8 encoded strings (although I may continue to do so rather than use the u8 prefix).

Alternatives to avoid

TCHAR: TCHAR is for migrating ancient Windows programs that assume legacy encodings from char to wchar_t, and is best forgotten unless your program was written in some previous millennium. It's not portable and is inherently unspecific about its encoding and even its data type, making it unusable with any non-TCHAR based API. Since its purpose is migration to wchar_t, which we've seen above isn't a good idea, there is no value whatsoever in using TCHAR.



1. Characters which are representable in wchar_t strings but which are not supported in any locale are not required to be represented with a single wchar_t value. This means that wchar_t could use a variable width encoding for certain characters, another clear violation of the intent of wchar_t. Although it's arguable that a character being representable by wchar_t is enough to say that the locale 'supports' that character, in which case variable-width encodings aren't legal and Window's use of UTF-16 is non-conformant.

2. Unicode allows many characters to be represented with multiple code points, which creates the same problems for simple text algorithms as variable width encodings. Even if one strictly maintains a composed normalization, some characters still require multiple code points. See: http://www.unicode.org/standard/where/

When should we prefer wide-character strings?

If we talk about Windows, than I'd use std::wstring (because we often need cool string features), or wchar_t* if you just pass strings around.

Note Microsoft recommends that here: Working with Strings

Windows natively supports Unicode strings for UI elements, file names,
and so forth. Unicode is the preferred character encoding, because it
supports all character sets and languages. Windows represents Unicode
characters using UTF-16 encoding, in which each character is encoded
as a 16-bit value. UTF-16 characters are called wide characters, to
distinguish them from 8-bit ANSI characters. The Visual C++ compiler
supports the built-in data type wchar_t for wide characters

Also:

When Microsoft introduced Unicode support to Windows, it eased the
transition by providing two parallel sets of APIs, one for ANSI
strings and the other for Unicode strings. [...] Internally, the ANSI
version translates the string to Unicode.

Also:

New applications should always call the Unicode versions. Many world
languages require Unicode. If you use ANSI strings, it will be
impossible to localize your application. The ANSI versions are also
less efficient, because the operating system must convert the ANSI
strings to Unicode at run time. [...] Most newer APIs in Windows have
just a Unicode version, with no corresponding ANSI version.

How to assemble a string of wide characters with some null ones inserted in the middle of it?

You can push_back a null char into a std::wstring as you build it.

Example:

std::wstring str;
str += L"DSN=NiceDB";
str.push_back(L'\0');
str += L"DBQ=C:\\Users\\who\\AppData\\Local\\NiceApp\\niceDB.accdb";
str.push_back(L'\0');

You can also manually append the null char using the += operator:

std::wstring str;
str += L"DSN=NiceDB";
str += L'\0';
str += L"DBQ=C:\\Users\\who\\AppData\\Local\\NiceApp\\niceDB.accdb";
str += L'\0';

You can also just tell the append method to use +1 characters of the string literal. That will implicitly pad the std::string with the null char already in the source:

std::wstring str;
const wchar_t* header = L"DSN=NiceDB";
const wchar_t* footer = L"DBQ=C:\\Users\\who\\AppData\\Local\\NiceApp\\niceDB.accdb";

str.append(header, wcslen(header) + 1);
str.append(footer, wcslen(footer) + 1);

Then to get the pointer to the start of the final string:

LPCWSTR wcAttrs = str.c_str();

The validity of the pointer returned by .c_str() is only good for the lifetime of the backing wstring. Don't let the wstring instance go out of scope while there's still something referencing wcAttrs.

C++: wide characters outputting incorrectly?

You need to define locale

    #include <stdio.h>
#include <string>
#include <locale>
#include <iostream>

using namespace std;

int main()
{

std::locale::global(std::locale(""));
wstring japan = L"日本";
wstring message = L"Welcome! Japan is ";

message += japan;

wprintf(message.c_str());
wcout << message << endl;
}

Works as expected (i.e. convert wide string to narrow UTF-8 and print it).

When you define global locale to "" - you set system locale (and if it is UTF-8 it would
be printed out as UTF-8 - i.e. wstring will be converted)

Edit: forget what I said about sync_with_stdio -- this is not correct, they are synchronized by default. Not needed.

Width of wide character strings

Here's a Github repo that claims to offer a platform independent library to resolve this: https://github.com/joshuarubin/wcwidth9

Archived link: http://archive.is/C5UAF



Related Topics



Leave a reply



Submit