Is Wchar_T Needed for Unicode Support

Is wchar_t needed for unicode support?

No.

Technically, no. Unicode is a standard that defines code points and it does not require a particular encoding.

So, you could use unicode with the UTF-8 encoding and then everything would fit in a one or a short sequence of char objects and it would even still be null-terminated.

The problem with UTF-8 and UTF-16 is that s[i] is not necessarily a character any more, it might be just a piece of one, whereas with sufficiently wide characters you can preserve the abstraction that s[i] is a single character, tho it does not make strings fixed-length under various transformations.

32-bit integers are at least wide enough to solve the code point problem but they still don't handle corner cases, e.g., upcasing something can change the number of characters.

So it turns out that the x[i] problem is not completely solved even by char32_t, and those other encodings make poor file formats.

Your implied point, then, is quite valid: wchar_t is a failure, partly because Windows made it only 16 bits, and partly because it didn't solve every problem and was horribly incompatible with the byte stream abstraction.

Is wchar_t needed to support unicode when using the WinApi?

I am not sure if I should compile with the multi-byte character set
and use all the Wide Character methods, or if I should compile with
the unicode character set and use all the ASCII methods.

  • If want to use Unicode, you must use the the wide "W" version
    of the functions along with wchar_t.
  • If you want to use multi-byte character set (MBCS), you must use the the "A" version of the functions with char.
  • The "A" version of the functions generally do not support all the Unicode characters and essentially widen their input parameter using MultiByteToWideChar(CP_ACP), call their "W" counterpart, and finally transform the result back into multi-byte using WideCharToMultiByte(CP_ACP).

Thus, I would advise you to use the "W" version of the functions assuming you are coding for Windows NT and above.

In the context of the Win32 API, Microsoft's "Unicode" generally refers to UTF-16.
Also Microsoft's MBCS is essentially DBCS and for Windows 9x operating systems, the "W" version of the functions are missing.

Should I use wchar_t when using UTF-8?

No, you should not! The Unicode 4.0 standard (ISO 10646:2003) notes that:

The width of wchar_t is compiler-specific and can be as small as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not use wchar_t for storing Unicode text.

Under most circumstances, the "character nature" of UTF-8 text will not be relevant to your program, so treating it as an array of char elements, just like any other string, will be sufficient. If you need to extract individual characters, though, those characters should be stored in a type that is at least 24 bits wide (e.g, uint32_t), in order to accomodate all Unicode code points.

Is handling unicode character with wchar_t good? Does it not cause any problems?

You are confusing two different things:

  1. Storage

    How you store the bytes that make up your text string. Will that be in an array of char (single-byte) values? Or will it be in the form of wchar_t (multi-byte) values?

  2. Encoding

    Your computer (and you!) needs to know what to do with the values in those bytes. What do they mean? Regardless of storage, they could be ASCII, some code page, UTF-8, UTF-16, UTF-32, Klingon, anything.

Usually, for historical reasons, we pick char for single-byte encodings (e.g. ASCII) and UTF-8, and wchar_t for UTF-16 (particularly on Windows, which has 16-bit wchar_ts and generally assumes this combination throughout its API — note that it inaccurately calls this simply "Unicode").

Performance doesn't really come into it, though you'll save time and energy converting between different encodings if you pick one and stick to it (and use a storage mechanism that fits the string libraries you're using). Sometimes your OS will help determine that choice, but we can't tell you what it will be.

Similarly, your statements about what "works" and "doesn't work" are very vague, and likely false.

We can't say what's "ok" without knowing the requirements of your project, and what sort of computer it'll run on, and with what technologies. I will, though, make a tremendous generalisation: in the olden days, you might have used Mazovia encoding, an altered codepage that included Polish characters; nowadays, you probably want to make portability and interchange as easy as possible (because why not?!), so you'd be encouraged to stick with UTF-16 over wchar_t on Windows, and UTF-8 over char otherwise.

(From C++20 we'll also have char8_t, a storage mechanism specifically designed to signify that it stores UTF-8-encoded data; however, it's going to be some time before you see this in widespread use, if at all. You can read more about C++'s character types on cppreference.com's article about "Fundamental types")

Why was wchar_t invented?

Why is wchar_t needed? How is it superior to short (or __int16 or whatever)?

In the C++ world, wchar_t is its own type (I think it's a typedef in C), so you can overload functions based on this. For example, this makes it possible to output wide characters and not to output their numerical value. In VC6, where wchar_t was just a typedef for unsigned short, this code

wchar_t wch = L'A'
std::wcout << wch;

would output 65 because

std::ostream<wchar_t>::operator<<(unsigned short)

was invoked. In newer VC versions wchar_t is a distinct type, so

std::ostream<wchar_t>::operator<<(wchar_t)

is called, and that outputs A.

What is the use of wchar_t in general programming?

wchar_t is intended for representing text in fixed-width, multi-byte encodings; since wchar_t is usually 2 bytes in size it can be used to represent text in any 2-byte encoding. It can also be used for representing text in variable-width multi-byte encodings of which the most common is UTF-16.

On platforms where wchar_t is 4 bytes in size it can be used to represent any text using UCS-4 (Unicode), but since on most platforms it's only 2 bytes it can only represent Unicode in a variable-width encoding (usually UTF-16). It's more common to use char with a variable-width encoding e.g. UTF-8 or GB 18030.

About the only modern operating system to use wchar_t extensively is Windows; this is because Windows adopted Unicode before it was extended past U+FFFF and so a fixed-width 2-byte encoding (UCS-2) appeared sensible. Now UCS-2 is insufficient to represent the whole of Unicode and so Windows uses UTF-16, still with wchar_t 2-byte code units.

Is wchar_t useful for the Windows API anymore?

It is definitely useful and the only way to correctly handle arbitrary path names (since they are allowed to contain wide characters). The choice of UTF-16 is often criticized (with a good reason), but that's irrelevant. The OS uses it, so you have to use it, too. The best you can do is to always call the wide character version of WINAPI functions (e.g. OpenFileW) and use UTF-8 in your program internally. Yes, that means converting back-and-forth, but that usually isn't a performance bottleneck.

I strongly recommend the UTF-8 Manifesto which explains why objectively this is the best way to go.

Portability, cross-platform interoperability and simplicity are more
important than interoperability with existing platform APIs. So, the
best approach is to use UTF-8 narrow strings everywhere and convert
them back and forth when using platform APIs that don’t support UTF-8
and accept wide strings (e.g. Windows API). Performance is seldom an
issue of any relevance when dealing with string-accepting system APIs
(e.g. UI code and file system APIs), and there is a great advantage to
using the same encoding everywhere else in the application, so we see
no sufficient reason to do otherwise.

Does the C++ standard mandate an encoding for wchar_t?

wchar_t is just an integral literal. It has a min value, a max value, etc.

Its size is not fixed by the standard.

If it is large enough, you can store UCS-2 or UCS-4 data in a buffer of wchar_t. This is true regardless of the system you are on, as UCS-2 and UCS-4 and UTF-16 and UTF-32 are just descriptions of integer values arranged in a sequence.

In C++11, there are std APIs that read or write data presuming it has those encodings. In C++03, there are APIs that read or write data using the current locale.

22.5 Standard code conversion facets [locale.stdcvt]

3 For each of the three code conversion facets codecvt_utf8, codecvt_utf16, and codecvt_utf8_utf16:

(3.1) — Elem is the wide-character type, such as wchar_t, char16_t, or char32_t.

4 For the facet codecvt_utf8:

(4.1) — The facet shall convert between UTF-8 multibyte sequences and UCS2 or UCS4 (depending on the size of Elem) within the program.

So here it codecvt_utf8_utf16 deals with utf8 on one side, and UCS2 or UCS4 (depending on how big Elem is) on the other. It does conversion.

The Elem (the wide character) is presumed to be encoded in UCS2 or UCS4 depending on how big it is.

This does not mean that wchar_t is encoded as such, it just means this operation interprets the wchar_t as being encoded as such.

How the UCS2 or UCS4 got into the Elem is not something this part of the standard cares about. Maybe you set it in there with hex constants. Maybe you read it from io. Maybe you calculated it on the fly. Maybe you used a high-quality random-number generator. Maybe you added together the bit-values of an ascii string. Maybe you calculated a fixed-point approximation of the log* of the number of seconds it takes the moon to change the Earth's day by 1 second. Not these paragraphs problems. These pragraphs simply mandate how bits are modified and interpreted.

Similar claims hold in other cases. This does not mandate what format wchar_t have. It simply states how these facets interpret wchar_t or char16_t or char32_t or char8_t (reading or writing).

Other ways of interacting with wchar_t use different methods to mandate how the value of the wchar_t is interpreted.

iswalpha uses the (global) locale to interpret the wchar_t, for example. In some locals, the wchar_t may be UCS2. In others, it might be some insane cthulian encoding whose details enable you to see a new color from out of space.

To be explicit: encodings are not the property of data, or bits. Encodings are properties of interpretation of data. Quite often there is only one proper or reasonable interpretation of data that makes any sense, but the data itself is bits.

The C++ standard does not mandate what is stored in a wchar_t. It does mandate what certain operations interpret the contents of a wchar_t to be. That section describes how some facets interpret the data in a wchar_t.



Related Topics



Leave a reply



Submit