Cross-Platform C++: Use The Native String Encoding or Standardise Across Platforms

Cross-platform C++: Use the native string encoding or standardise across platforms?

and UTF-8 in linux.

It's mostly true for modern Linux. Actually encoding depends on what API or library is used. Some hardcoded to use UTF-8. But some read LC_ALL, LC_CTYPE or LANG environment variables to detect encoding to use (like Qt library). So be careful.

We can't decide whether the best approach

As usual it depends.

If 90% of code is to deal with platform specific API in platform specific way, obviously it is better to use platform specific strings. As an example - a device driver or native iOS application.

If 90% of code is complex business logic that is shared across platforms, obviously it is better to use same encoding on all platforms. As an example - chat client or browser.

In second case you have a choice:

  • Use cross platform library that provides strings support (Qt, ICU, for example)
  • Use bare pointers (I consider std::string a "bare pointer" too)

If working with strings is a significant part of your application, choosing a nice library for strings is a good move. For example Qt has a very solid set of classes that covers 99% of common tasks. Unfortunately, I has no ICU experience, but it also looks very nice.

When using some library for strings you need to care about encoding only when working with external libraries, platform API or sending strings over the net (or disk). For example, a lot of Cocoa, C# or Qt (all has solid strings support) programmers know very little about encoding details (and it is good, since they can focus on their main task).

My experience in working with strings is a little specific, so I personally prefer bare pointers. Code that use them is very portable (in sense it can be easily reused in other projects and platforms) because has less external dependencies. It is extremely simple and fast also (but one probably need some experience and Unicode background to feel that).

I agree that bare pointers approach is not for everyone. It is good when:

  • You work with entire strings and splitting, searching, comparing is a rare task
  • You can use same encoding in all components and need a conversion only when using platform API
  • All your supported platforms has API to:

    • Convert from your encoding to that is used in API
    • Convert from API encoding to that is used in your code
  • Pointers is not a problem in your team

From my a little specific experience it is actually a very common case.

When working with bare pointers it is good to choose encoding that will be used in entire project (or in all projects).

From my point of view, UTF-8 is an ultimate winner. If you can't use UTF-8 - use strings library or platform API for strings - it will save you a lot of time.

Advantages of UTF-8:

  • Fully ASCII compatible. Any ASCII string is a valid UTF-8 string.
  • C std library works great with UTF-8 strings. (*)
  • C++ std library works great with UTF-8 (std::string and friends). (*)
  • Legacy code works great with UTF-8.
  • Quite any platform supports UTF-8.
  • Debugging is MUCH easier with UTF-8 (since it is ASCII compatible).
  • No Little-Endian/Big-Endian mess.
  • You will not catch a classical bug "Oh, UTF-16 is not always 2 bytes?".

(*) Until you need to lexical compare them, transform case (toUpper/toLower), change normalization form or something like this - if you do - use strings library or platform API.

Disadvantage is questionable:

  • Less compact for Chinese (and other symbols with large code point numbers) than UTF-16.
  • Harder (a little actually) to iterate over symbols.

So, I recommend to use UTF-8 as common encoding for project(s) that doesn't use any strings library.

But encoding is not the only question you need to answer.

There is such thing as normalization. To put it simple, some letters can be represented in several ways - like one glyph or like a combination of different glyphs. The common problem with this is that most of string compare functions treat them as different symbols. If you working on cross-platform project, choosing one of normalization forms as standard is a right move. This will save your time.

For example if user password contains "йёжиг" it will be differently represented (in both UTF-8 and UTF-16) when entered on Mac (that mostly use Normalization Form D) and on Windows (that mostly likes Normalization Form C). So if user registered under Windows with such password it will a problem for him to login under Mac.

In addition I would not recommend to use wchar_t (or use it only in windows code as a UCS-2/UTF-16 char type). The problem with wchar_t is that there is no encoding associated with it. It's just an abstract wide char that is larger than normal char (16 bits on Windows, 32 bits on most *nix).

What's wrong with C++ wchar_t and wstrings? What are some alternatives to wide characters?

What is wchar_t?

wchar_t is defined such that any locale's char encoding can be converted to a wchar_t representation where every wchar_t represents exactly one codepoint:

Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales (22.3.1).

                                                                               — C++ [basic.fundamental] 3.9.1/5

This does not require that wchar_t be large enough to represent any character from all locales simultaneously. That is, the encoding used for wchar_t may differ between locales. Which means that you cannot necessarily convert a string to wchar_t using one locale and then convert back to char using another locale.1

Since using wchar_t as a common representation between all locales seems to be the primary use for wchar_t in practice you might wonder what it's good for if not that.

The original intent and purpose of wchar_t was to make text processing simple by defining it such that it requires a one-to-one mapping from a string's code-units to the text's characters, thus allowing the use of the same simple algorithms as are used with ascii strings to work with other languages.

Unfortunately the wording of wchar_t's specification assume a one-to-one mapping between characters and codepoints to achieve this. Unicode breaks that assumption2, so you can't safely use wchar_t for simple text algorithms either.

This means that portable software cannot use wchar_t either as a common representation for text between locales, or to enable the use of simple text algorithms.

What use is wchar_t today?

Not much, for portable code anyway. If __STDC_ISO_10646__ is defined then values of wchar_t directly represent Unicode codepoints with the same values in all locales. That makes it safe to do the inter-locale conversions mentioned earlier. However you can't rely only on it to decide that you can use wchar_t this way because, while most unix platforms define it, Windows does not even though Windows uses the same wchar_t locale in all locales.

The reason Windows doesn't define __STDC_ISO_10646__ is because Windows use UTF-16 as its wchar_t encoding, and because UTF-16 uses surrogate pairs to represent codepoints greater than U+FFFF, which means that UTF-16 doesn't satisfy the requirements for __STDC_ISO_10646__.

For platform specific code wchar_t may be more useful. It's essentially required on Windows (e.g., some files simply cannot be opened without using wchar_t filenames), though Windows is the only platform where this is true as far as I know (so maybe we can think of wchar_t as 'Windows_char_t').

In hindsight wchar_t is clearly not useful for simplifying text handling, or as storage for locale independent text. Portable code should not attempt to use it for these purposes. Non-portable code may find it useful simply because some API requires it.

Alternatives

The alternative I like is to use UTF-8 encoded C strings, even on platforms not particularly friendly toward UTF-8.

This way one can write portable code using a common text representation across platforms, use standard datatypes for their intended purpose, get the language's support for those types (e.g. string literals, though some tricks are necessary to make it work for some compilers), some standard library support, debugger support (more tricks may be necessary), etc. With wide characters it's generally harder or impossible to get all of this, and you may get different pieces on different platforms.

One thing UTF-8 does not provide is the ability to use simple text algorithms such as are possible with ASCII. In this UTF-8 is no worse than any other Unicode encoding. In fact it may be considered to be better because multi-code unit representations in UTF-8 are more common and so bugs in code handling such variable width representations of characters are more likely to be noticed and fixed than if you try to stick to UTF-32 with NFC or NFKC.

Many platforms use UTF-8 as their native char encoding and many programs do not require any significant text processing, and so writing an internationalized program on those platforms is little different from writing code without considering internationalization. Writing more widely portable code, or writing on other platforms requires inserting conversions at the boundaries of APIs that use other encodings.

Another alternative used by some software is to choose a cross-platform representation, such as unsigned short arrays holding UTF-16 data, and then to supply all the library support and simply live with the costs in language support, etc.

C++11 adds new kinds of wide characters as alternatives to wchar_t, char16_t and char32_t with attendant language/library features. These aren't actually guaranteed to be UTF-16 and UTF-32, but I don't imagine any major implementation will use anything else. C++11 also improves UTF-8 support, for example with UTF-8 string literals so it won't be necessary to trick VC++ into producing UTF-8 encoded strings (although I may continue to do so rather than use the u8 prefix).

Alternatives to avoid

TCHAR: TCHAR is for migrating ancient Windows programs that assume legacy encodings from char to wchar_t, and is best forgotten unless your program was written in some previous millennium. It's not portable and is inherently unspecific about its encoding and even its data type, making it unusable with any non-TCHAR based API. Since its purpose is migration to wchar_t, which we've seen above isn't a good idea, there is no value whatsoever in using TCHAR.



1. Characters which are representable in wchar_t strings but which are not supported in any locale are not required to be represented with a single wchar_t value. This means that wchar_t could use a variable width encoding for certain characters, another clear violation of the intent of wchar_t. Although it's arguable that a character being representable by wchar_t is enough to say that the locale 'supports' that character, in which case variable-width encodings aren't legal and Window's use of UTF-16 is non-conformant.

2. Unicode allows many characters to be represented with multiple code points, which creates the same problems for simple text algorithms as variable width encodings. Even if one strictly maintains a composed normalization, some characters still require multiple code points. See: http://www.unicode.org/standard/where/

In C++ when to use WCHAR and when to use CHAR

Use char and treat it as UTF-8. There are a great many reasons for this; this website summarises it much better than I can:

http://utf8everywhere.org/

It recommends converting from wchar_t to char (UTF-16 to UTF-8) as soon as you receive it from any library, and converting back when you need to pass strings to it. So to answer your question, always use char except at the point that an API requires you to pass or receive wchar_t.

c++11 threading vs .Net threading?

My program is windows only and is built with .net.

Considering that managed languages usually prefer developer productivity over performance (when in conflict), I'd say that managed threading is likely to be more developer friendly. Also, Garbage Collection is a well-known productivity feature.

Do you have an extreme need for performance and/or control? If not, I recommend managed code and managed threading.

How can my program switch from ASCII to Unicode?

I want to write a program in C++ that
should work on Unix and Windows.

First, make sure you understand the difference between how Unix supports Unicode and how Windows supports Unicode.

In the pre-Unicode days, both platforms were similar in that each locale had its own preferred character encodings. Strings were arrays of char. One char = one character, except in a few East Asian locales that used double-byte encodings (which were awkward to handle due to being non-self-synchronizing).

But they approached Unicode in two different ways.

Windows NT adopted Unicode in the early days when Unicode was intended to be a fixed-width 16-bit character encoding. Microsoft wrote an entirely new version of the Windows API using 16-bit characters (wchar_t) instead of 8-bit char. For backwards-compatibility, they kept the old "ANSI" API around and defined a ton of macros so you could call either the "ANSI" or "Unicode" version depending on whether _UNICODE was defined.

In the Unix world (specifically, Plan 9 from Bell Labs), developers decided it would be easier to expand Unix's existing East Asian multi-byte character support to handle 3-byte characters, and created the encoding now known as UTF-8. In recent years, Unix-like systems have been making UTF-8 the default encoding for most locales.

Windows theoretically could expand their ANSI support to include UTF-8, but they still haven't, because of hard-coded assumptions about the maximum size of a character. So, on Windows, you're stuck with an OS API that doesn't support UTF-8 and a C++ runtime library that doesn't support UTF-8.

The upshot of this is that:

  • UTF-8 is the easiest encoding to work with on Unix.
  • UTF-16 is the easiest encoding to work with on Windows.

This creates just as much complication for cross-platform code as it sounds. It's easier if you just pick one Unicode encoding and stick to it.

Which encoding should that be?

See UTF-8 or UTF-16 or UTF-32 or UCS-2

In summary:

  • UTF-8 lets you keep the assumption of 8-bit code units.
  • UTF-32 lets you keep the assumption of fixed-width characters.
  • UTF-16 sucks, but it's still around because of Windows and Java.

wchar_t

is the standard C++ "wide character" type. But it's encoding is not standardized: It's UTF-16 on Windows and UTF-32 on Unix. Except on those platforms that use locale-dependent wchar_t encodings as a legacy from East Asian programming.

If you want to use UTF-32, use a uint32_t or equivalent typedef to store characters. Or use wchar_t if __STDC_ISO_10646__ is defined and uint32_t.

The new C++ standard will have char16_t and char32_t, which will hopefully clear up the confusion on how to represent UTF-16 and UTF-32.

TCHAR

is a Windows typedef for wchar_t (assumed to be UTF-16) when _UNICODE is defined and char (assumed to be "ANSI") otherwise. It was designed to deal with the overloaded Windows API mentioned above.

In my opinion, TCHAR sucks. It combines the disadvantages of having platform-dependent char with the disadvantages of platform-dependent wchar_t. Avoid it.

The most important consideration

Character encodings are about information interchange. That's what the "II" stands for in ASCII. Your program doesn't exist in a vacuum. You have to read and write files, which are more likely to be encoded in UTF-8 than in UTF-16.

On the other hand, you may be working with libraries that use UTF-16 (or more rarely, UTF-32) characters. This is especially true on Windows.

My recommendation is to use the encoding form that minimizes the amount of conversion you have to do.

This program should be able to use
both: the Unicode and non Unicode
environments

It would be much better to have your program work entirely in Unicode internally and only deal with legacy encodings for reading legacy data (or writing it, but only if explicitly asked to.)

What's the purpose of QString?

My question here is, what's the purpose of QString if std::string is part of the standard library?

Reasons that I can think of:

  1. QString has been part of the Qt library way before std::string came to life.

  2. Its interface includes a lot of Qt specific classes. Hence, the usage of QString cannot be easily replaced by std::string.

  3. Its interface is a lot richer than std::string.

c++11 threading vs .Net threading?

My program is windows only and is built with .net.

Considering that managed languages usually prefer developer productivity over performance (when in conflict), I'd say that managed threading is likely to be more developer friendly. Also, Garbage Collection is a well-known productivity feature.

Do you have an extreme need for performance and/or control? If not, I recommend managed code and managed threading.



Related Topics



Leave a reply



Submit