Should I Use Wchar_T When Using Utf-8

Should I use wchar_t when using UTF-8?

No, you should not! The Unicode 4.0 standard (ISO 10646:2003) notes that:

The width of wchar_t is compiler-specific and can be as small as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not use wchar_t for storing Unicode text.

Under most circumstances, the "character nature" of UTF-8 text will not be relevant to your program, so treating it as an array of char elements, just like any other string, will be sufficient. If you need to extract individual characters, though, those characters should be stored in a type that is at least 24 bits wide (e.g, uint32_t), in order to accomodate all Unicode code points.

char vs wchar_t when to use which data type

Fundamentally, use wchar_t when the encoding has more symbols than a char can contain.

Background
The char type has enough capacity to hold any character (encoding) in the ASCII character set.

The issue is that many languages require more encodings than the ASCII accounts for. So, instead of 127 possible encodings, more are needed. Some languages have more than 256 possible encodings. A char type does not guarantee a range greater than 256. Thus a new data type is required.

The wchar_t, a.k.a. wide characters, provides more room for encodings.

Summary
Use char data type when the range of encodings is 256 or less, such as ASCII. Use wchar_t when you need the capacity for more than 256.

Prefer Unicode to handle large character sets (such as emojis).

In C++ when to use WCHAR and when to use CHAR

Use char and treat it as UTF-8. There are a great many reasons for this; this website summarises it much better than I can:

http://utf8everywhere.org/

It recommends converting from wchar_t to char (UTF-16 to UTF-8) as soon as you receive it from any library, and converting back when you need to pass strings to it. So to answer your question, always use char except at the point that an API requires you to pass or receive wchar_t.

Is wchar_t needed for unicode support?

No.

Technically, no. Unicode is a standard that defines code points and it does not require a particular encoding.

So, you could use unicode with the UTF-8 encoding and then everything would fit in a one or a short sequence of char objects and it would even still be null-terminated.

The problem with UTF-8 and UTF-16 is that s[i] is not necessarily a character any more, it might be just a piece of one, whereas with sufficiently wide characters you can preserve the abstraction that s[i] is a single character, tho it does not make strings fixed-length under various transformations.

32-bit integers are at least wide enough to solve the code point problem but they still don't handle corner cases, e.g., upcasing something can change the number of characters.

So it turns out that the x[i] problem is not completely solved even by char32_t, and those other encodings make poor file formats.

Your implied point, then, is quite valid: wchar_t is a failure, partly because Windows made it only 16 bits, and partly because it didn't solve every problem and was horribly incompatible with the byte stream abstraction.

Is handling unicode character with wchar_t good? Does it not cause any problems?

You are confusing two different things:

  1. Storage

    How you store the bytes that make up your text string. Will that be in an array of char (single-byte) values? Or will it be in the form of wchar_t (multi-byte) values?

  2. Encoding

    Your computer (and you!) needs to know what to do with the values in those bytes. What do they mean? Regardless of storage, they could be ASCII, some code page, UTF-8, UTF-16, UTF-32, Klingon, anything.

Usually, for historical reasons, we pick char for single-byte encodings (e.g. ASCII) and UTF-8, and wchar_t for UTF-16 (particularly on Windows, which has 16-bit wchar_ts and generally assumes this combination throughout its API — note that it inaccurately calls this simply "Unicode").

Performance doesn't really come into it, though you'll save time and energy converting between different encodings if you pick one and stick to it (and use a storage mechanism that fits the string libraries you're using). Sometimes your OS will help determine that choice, but we can't tell you what it will be.

Similarly, your statements about what "works" and "doesn't work" are very vague, and likely false.

We can't say what's "ok" without knowing the requirements of your project, and what sort of computer it'll run on, and with what technologies. I will, though, make a tremendous generalisation: in the olden days, you might have used Mazovia encoding, an altered codepage that included Polish characters; nowadays, you probably want to make portability and interchange as easy as possible (because why not?!), so you'd be encouraged to stick with UTF-16 over wchar_t on Windows, and UTF-8 over char otherwise.

(From C++20 we'll also have char8_t, a storage mechanism specifically designed to signify that it stores UTF-8-encoded data; however, it's going to be some time before you see this in widespread use, if at all. You can read more about C++'s character types on cppreference.com's article about "Fundamental types")

UTF8 vs Wide Char?

If by "wide char", you are referring to wchar_t, then you have to take into account that it is 16-bit (using UCS-2 or UTF-16) on some platforms, but is 32-bit (using UTF-32) on other platforms. So asking how to convert to/from "wide char", you first have to define what "wide char" actually means. Proper 16-bit/32-bit data types need to be used when dealing with UTF-16/32.

Pretty much any Unicode library, including utf8-cpp and ICU, has functions for converting between UTF8<->UTF16 and UTF8<->UTF32 using appropriate data types and not relying on wchar_t.

What is the use of wchar_t in general programming?

wchar_t is intended for representing text in fixed-width, multi-byte encodings; since wchar_t is usually 2 bytes in size it can be used to represent text in any 2-byte encoding. It can also be used for representing text in variable-width multi-byte encodings of which the most common is UTF-16.

On platforms where wchar_t is 4 bytes in size it can be used to represent any text using UCS-4 (Unicode), but since on most platforms it's only 2 bytes it can only represent Unicode in a variable-width encoding (usually UTF-16). It's more common to use char with a variable-width encoding e.g. UTF-8 or GB 18030.

About the only modern operating system to use wchar_t extensively is Windows; this is because Windows adopted Unicode before it was extended past U+FFFF and so a fixed-width 2-byte encoding (UCS-2) appeared sensible. Now UCS-2 is insufficient to represent the whole of Unicode and so Windows uses UTF-16, still with wchar_t 2-byte code units.



Related Topics



Leave a reply



Submit