Wchars, Encodings, Standards and Portability

WChars, Encodings, Standards and Portability

Is this the right way to write an idiomatic, portable, universal, encoding-agnostic program core using only pure standard C/C++

No, and there is no way at all to fulfill all these properties, at least if you want your program to run on Windows. On Windows, you have to ignore the C and C++ standards almost everywhere and work exclusively with wchar_t (not necessarily internally, but at all interfaces to the system). For example, if you start with

int main(int argc, char** argv)

you have already lost Unicode support for command line arguments. You have to write

int wmain(int argc, wchar_t** argv)

instead, or use the GetCommandLineW function, none of which is specified in the C standard.

More specifically,

any Unicode-capable program on Windows must actively ignore the C and C++ standard for things like command line arguments, file and console I/O, or file and directory manipulation. This is certainly not idiomatic. Use the Microsoft extensions or wrappers like Boost.Filesystem or Qt instead.
Portability is extremely hard to achieve, especially for Unicode support. You really have to be prepared that everything you think you know is possibly wrong. For example, you have to consider that the filenames you use to open files can be different from the filenames that are actually used, and that two seemingly different filenames may represent the same file. After you create two files a and b, you might end up with a single file c, or two files d and e, whose filenames are different from the file names you passed to the OS. Either you need an external wrapper library or lots of #ifdefs.
Encoding agnosticity usually just doesn't work in practice, especially if you want to be portable. You have to know that wchar_t is a UTF-16 code unit on Windows and that char is often (bot not always) a UTF-8 code unit on Linux. Encoding-awareness is often the more desirable goal: make sure that you always know with which encoding you work, or use a wrapper library that abstracts them away.

I think I have to conclude that it's completely impossible to build a portable Unicode-capable application in C or C++ unless you are willing to use additional libraries and system-specific extensions, and to put lots of effort in it. Unfortunately, most applications already fail at comparatively simple tasks such as "writing Greek characters to the console" or "supporting any filename allowed by the system in a correct manner", and such tasks are only the first tiny steps towards true Unicode support.

Unicode Portability

Multiplatform issues comes from the fact that there are many encodings, and a wrong encoding pick will lead to encÃ³ding Ãssues. Once you tackle that problem, you should be able to use std::wstring on all your program.

The usual workflow is:

raw_input_data = read_raw_data()
input_encoding = "???" // What is your file or terminal encoding?

unicode_data = convert_to_unicode(raw_input_data, input_encoding)

// Do something with the unicode_data, store in some var, etc.

output_encoding = "???" // Is your terminal output encoding the same as your input?
raw_output_data = convert_from_unicode(unicode_data, output_encoding)

print_raw_data(raw_data)

Most Unicode issues comes from wrongly detecting the values of input_encoding and output_encoding. On a modern Linux distribution this is usually UTF-8. On Windows YMMV.

Standard C++ don't know about encodings, you should use some library like ICU to do the conversion.

Does the C++ standard mandate an encoding for wchar_t?

wchar_t is just an integral literal. It has a min value, a max value, etc.

Its size is not fixed by the standard.

If it is large enough, you can store UCS-2 or UCS-4 data in a buffer of wchar_t. This is true regardless of the system you are on, as UCS-2 and UCS-4 and UTF-16 and UTF-32 are just descriptions of integer values arranged in a sequence.

In C++11, there are std APIs that read or write data presuming it has those encodings. In C++03, there are APIs that read or write data using the current locale.

22.5 Standard code conversion facets [locale.stdcvt]
3 For each of the three code conversion facets codecvt_utf8, codecvt_utf16, and codecvt_utf8_utf16:
(3.1) — Elem is the wide-character type, such as wchar_t, char16_t, or char32_t.
4 For the facet codecvt_utf8:
(4.1) — The facet shall convert between UTF-8 multibyte sequences and UCS2 or UCS4 (depending on the size of Elem) within the program.

So here it codecvt_utf8_utf16 deals with utf8 on one side, and UCS2 or UCS4 (depending on how big Elem is) on the other. It does conversion.

The Elem (the wide character) is presumed to be encoded in UCS2 or UCS4 depending on how big it is.

This does not mean that wchar_t is encoded as such, it just means this operation interprets the wchar_t as being encoded as such.

How the UCS2 or UCS4 got into the Elem is not something this part of the standard cares about. Maybe you set it in there with hex constants. Maybe you read it from io. Maybe you calculated it on the fly. Maybe you used a high-quality random-number generator. Maybe you added together the bit-values of an ascii string. Maybe you calculated a fixed-point approximation of the log* of the number of seconds it takes the moon to change the Earth's day by 1 second. Not these paragraphs problems. These pragraphs simply mandate how bits are modified and interpreted.

Similar claims hold in other cases. This does not mandate what format wchar_t have. It simply states how these facets interpret wchar_t or char16_t or char32_t or char8_t (reading or writing).

Other ways of interacting with wchar_t use different methods to mandate how the value of the wchar_t is interpreted.

iswalpha uses the (global) locale to interpret the wchar_t, for example. In some locals, the wchar_t may be UCS2. In others, it might be some insane cthulian encoding whose details enable you to see a new color from out of space.

To be explicit: encodings are not the property of data, or bits. Encodings are properties of interpretation of data. Quite often there is only one proper or reasonable interpretation of data that makes any sense, but the data itself is bits.

The C++ standard does not mandate what is stored in a wchar_t. It does mandate what certain operations interpret the contents of a wchar_t to be. That section describes how some facets interpret the data in a wchar_t.

How do I fix these encodings?

When you use the writeScannedKey function, the following codes all write two bytes to the file

 WriteFile(file, text, wcslen(text) * sizeof(wchar_t), NULL, NULL);

But when you use the writeLog4 function, you only write one byte at a time:

WriteFile(file, (LPCVOID)text, sizeof(text) / sizeof(char) * sizeof(char), NULL, NULL);

When you use UTF-16 encoding to read a file, two bytes will be read as one character, a single byte will be combined with subsequent bytes, and then an error character will be displayed.

How to do string operations with Win32 WCHAR

Shorter Answer

The specific problem you’re having is that current[n] is the nth element in the array, not the nth byte of the array. Doing pointer arithmetic like current + n also gives you the nth element after the one current points to. The same is true if you declare an array of int, double, some struct or anything else.

So, when you declare an array wchar_t a[] = L"!", then take wcslen(a), you get back the count of wide characters in the array, 1. If you try to set i = wcslen(a) - 2; and then take a[i], i will be -1, which is a serious bug.

Longer Explanation

On Windows, WCHAR is an alias for the standard type wchar_t. You don’t say whether you’re writing in C or C++. There are a number of functions in the C standard library to manipulate wide-character strings, in <wchar.h> and <wctype.h>. The C++ standard library has all of these, as well as std::wstring in <string> and wide-character streams including std::wcout, std::wcin and std::wcerr (although Windows doesn’t fully support them). Most Windows API functions also can accept wide-character strings. The standard type of a wide character string is wchar_t*, but WCHAR*, LPWSTR and, by default on modern versions of Visual Studio, TCHAR* and LPTSTR also work.

On Windows, wide characters are little-endian UTF-16. This is not portable, but then, neither is WCHAR. On some other systems, wide characters are either big-endian UTF-16, or big- or little-endian UTF-32. In C, the standard types char16_t and char32_t are defined in <uchar.h>. In C++, they are built into the language. If you try to pass a char16_t* to a function that expects a wchar_t*, it won’t work without a cast, or on targets other than Windows at all.

UTF-8 is a way of storing Unicode code points that’s backwards-compatible with seven-bit ASCII. UTF-8 is an alternative representation from UTF-16 or UTF-32. A UTF-8 string will be stored in an array of unsigned char or char, with one Unicode code point potentially needing several bytes to store it. Actually, because of surrogate pairs, a Unicode code point potentially needs two UTF-16 objects to encode it, as well. There are some times when it’s convenient to use a different representation (UTF-16LE is what the Windows ABI expects and what some libraries like ICU and QT use internally, and UTF-32 is the only representation that guarantees all Unicode characters will fit into a single element), but my advice is to use UTF-8 whenever you can and some other encoding whenever you have to.

Possible solution

If you want to read backwards through a wide string, you might try this:

int i = wcslen(inStr); // Could be 0.

if (i > 0) { // Don't read one element past the start of the array.
  do {
    --i;
  } while ( i > 0 && inStr[i] != L'/' );
}

/* When we reach this line, i is either 0 or the index of the last slash
 * in inStr, which could also be 0.  We can test whether inStr[i] == L'/' or
 * write an if() within our loop to do something more complicated.
 */

What are inconveniences of using UTF-8 instead of wchar_t with non-Western languages?

Assuming the library functions work for UTF-8 (this is not true for Windows generally), then there's no real problem as long as you actually USE library functions. However, if you write code that manually interprets individual elements in a string array, you need to write code that takes into account that a code-point is more than a single byte in UTF-8 - particularly when dealing with non-English characters (including for example German/Scandinavian characters such as 'ä', 'ö', 'ü'). And even with 16-bit per entry, you can find situations where one code-point takes up 2 16-bit entries.

If you don't take this into account, the separate parts can "confuse" processing, e.g. recognise things in the middle of a code-point as having a different meaning than being the middle of something.

The variable length of a code-point leads to all sorts of interesting effects on for example string lengths and substrings - where the length in is in number of elements of the array holding the string, which can be quite different from the number of code-points.

Whichever encoding is used, there are further complications with for example Arabic languages, where individual characters need to be chained together. This is of course only important when actually drawing characters, but is worth at least bearing in mind.

Terminology (for my writings!):

Character = A letter/symbol such that can be displayed on screen.

Code-point = representation of a character in a string, may be one or more elements in a string array.

String array = the storage for a string, consists of elements of a fixed size (e.g. 8 bits, 16 bits, 32 bits, 64 bits)

String Element = One unit of a string array.

How to implement C++ asian characters for cross platform?

Why not wchar_t and wstring? Yes, it's 4 bytes on some platforms and 2 bytes on others; still, it has the advantage of having a bunch of string processing RTL routines built around it. Cocoa's NSString/CFString is 2 bytes per character (like wchar_t on Windows), but it's extremely unportable.

You'd have to be careful around persistence and wire formats - make sure they don't depend on the size of wchar_t.

Depends, really, on what's your optimization priority. If you have intense processing (parsing, etc), go with wchar_t. If you'd rather interact smoothly with the host system, opt for whatever format matches the assumptions of the host OS.

Redefining wchar_t to be two bytes is an option, too. It's -fshort-wchar with GCC. You'll lose the whole body of wcs* RTL and a good portion of STL, but there will be less codepage translation when interacting with the host system. It happens so that both big-name mobile platforms out there (one fruit-themed, one robot-themed) happen to have two byte strings as their native format, but 4 byte wchar_t by default. -fshort-wchar works on both, I've tried.

Here's a handy summary of desktop and mobile platforms:

Windows, Windows Phone, Windows RT, Windows CE: wchar_t is 2 bytes, OS uses UTF-16
Vanilla desktop Linux: wchar_t is 4 bytes, OS uses UTF-8, various frameworks may use who knows what (Qt, notably, uses UTF-16)
MacOS X, iOS: wchar_t is 4 bytes, OS uses UTF-16, userland comes with a alternative 2-byte-based string RTL
Android: wchar_t is 4 bytes, OS uses UTF-8, but the layer of interaction with Java uses UTF-16
Samsung bada: wchar_t is 2 bytes, the userland API uses UTF-16, POSIX layer is severely crippled anyway so who cares

Compare std::wstring and std::string

Since you asked, here's my standard conversion functions from string to wide string, implemented using C++ std::string and std::wstring classes.

First off, make sure to start your program with set_locale:

#include <clocale>

int main()
{
  std::setlocale(LC_CTYPE, "");  // before any string operations
}

Now for the functions. First off, getting a wide string from a narrow string:

#include <string>
#include <vector>
#include <cassert>
#include <cstdlib>
#include <cwchar>
#include <cerrno>

// Dummy overload
std::wstring get_wstring(const std::wstring & s)
{
  return s;
}

// Real worker
std::wstring get_wstring(const std::string & s)
{
  const char * cs = s.c_str();
  const size_t wn = std::mbsrtowcs(NULL, &cs, 0, NULL);

  if (wn == size_t(-1))
  {
    std::cout << "Error in mbsrtowcs(): " << errno << std::endl;
    return L"";
  }

  std::vector<wchar_t> buf(wn + 1);
  const size_t wn_again = std::mbsrtowcs(buf.data(), &cs, wn + 1, NULL);

  if (wn_again == size_t(-1))
  {
    std::cout << "Error in mbsrtowcs(): " << errno << std::endl;
    return L"";
  }

  assert(cs == NULL); // successful conversion

  return std::wstring(buf.data(), wn);
}

And going back, making a narrow string from a wide string. I call the narrow string "locale string", because it is in a platform-dependent encoding depending on the current locale:

// Dummy
std::string get_locale_string(const std::string & s)
{
  return s;
}

// Real worker
std::string get_locale_string(const std::wstring & s)
{
  const wchar_t * cs = s.c_str();
  const size_t wn = std::wcsrtombs(NULL, &cs, 0, NULL);

  if (wn == size_t(-1))
  {
    std::cout << "Error in wcsrtombs(): " << errno << std::endl;
    return "";
  }

  std::vector<char> buf(wn + 1);
  const size_t wn_again = std::wcsrtombs(buf.data(), &cs, wn + 1, NULL);

  if (wn_again == size_t(-1))
  {
    std::cout << "Error in wcsrtombs(): " << errno << std::endl;
    return "";
  }

  assert(cs == NULL); // successful conversion

  return std::string(buf.data(), wn);
}

Some notes:

If you don't have std::vector::data(), you can say &buf[0] instead.
I've found that the r-style conversion functions mbsrtowcs and wcsrtombs don't work properly on Windows. There, you can use the mbstowcs and wcstombs instead: mbstowcs(buf.data(), cs, wn + 1);, wcstombs(buf.data(), cs, wn + 1);

In response to your question, if you want to compare two strings, you can convert both of them to wide string and then compare those. If you are reading a file from disk which has a known encoding, you should use iconv() to convert the file from your known encoding to WCHAR and then compare with the wide string.

Beware, though, that complex Unicode text may have multiple different representations as code point sequences which you may want to consider equal. If that is a possibility, you need to use a higher-level Unicode processing library (such as ICU) and normalize your strings to some common, comparable form.

Wchars, Encodings, Standards and Portability