How to Use Std::Imbue to Set the Locale for Std::Wcout

How can I use std::imbue to set the locale for std::wcout?

In this answer, I'm taking the questions in reverse order, and adding another (with answer) that came up along the way.

Is there way to use imbue rather than setting the global locale to do what I want?

Yes. By default, std::wcout is synchronized to the underlying stdout C stream. So std::wcout can use imbue if that synchronization is turned off, allowing the C++ stream to operate independently. So to modify the original code to use imbue and work as intended only a single line need be added, calling std::ios_base::sync_with_stdio:

std::ios_base::sync_with_stdio(false);
std::wcout.imbue(ru);

Why didn't the original version work?

The standard (I'm referring to INCITS/ISO/IEC 14882-2011[2012]) says very little about the tie to the underlying stdio stream, but in 27.4.3 it says

The object wcout controls output to a stream buffer associated with the object stdout, declared in <cstdio>

Further, without explicitly setting a global locale, the locale is the "C" locale which is US English ASCII, so this appears to imply that stdout will, by default, have an ASCII mapping. Since no Cyrillic characters are represented in ASCII, the underlying stdout is what converts the proper Russian into a series of ? characters.

Why must the sync_with_stdio call precede imbue?

According to 27.5.3.4 of the standard:

If any input or output operation has occurred using the standard streams prior to the call,
the effect is implementation-defined. Otherwise, called with a false argument, it allows the standard streams to operate independently of the standard C streams.

How to handle multiple locales for ifstream, cout, etc, in c++

This works for me as expected on my Linux machine, but not on my Windows machine under Cygwin (the set of available locales is apparently the same on both machines, but std::locale::locale just fails with every imaginable locale string).

#include <iostream>
#include <fstream>
#include <locale>
#include <string>

void printFile(const char* name, const char* loc)
{
try {
std::wifstream inFile;
inFile.imbue(std::locale(loc));
inFile.open(name);
std::wstring line;
while (getline(inFile, line))
std::wcout << line << '\n';
} catch (std::exception& e) {
std::cerr << e.what() << std::endl;
}
}

int main()
{
std::locale::global(std::locale("en_US.utf8"));

printFile ("gtext-u8.txt", "de_DE.utf8"); // utf-8 text: grüßen
printFile ("gtext-legacy.txt", "de_DE@euro"); // iso8859-15 text: grüßen
}

Output:

grüßen
grüßen

std::wcin.eof(), UTF-8 and locales on different systems

This is a libc++ bug.

Note the bug report says that it only affects std::wcin and not file streams, but in my experiments this is not the case. All wchar_t streams seem to be affected.

The other major open source implementation, libstdc++, doesn't have this bug. It is possible to sidestep the libc++ bug by building the entire application (including all dynamic libraries, if any) against libstdc++.

If this is not an option, then one way to cope with the bug is to use narrow char streams, and then, when needed, recode the characters (presumably arriving encoded as UTF-8) to wchar_t (presumably UCS-4) separately. Another way is to get rid of wchar_t altogether and work in UTF-8 throughout the program, which is probably better in the long run.

Multiple calls to setlocale

Is there something that setlocale(LC_ALL, NULL) does that needs to be taken care of in future setlocale calls?

No, setlocale(..., NULL) does not modify the current locale. The following code is fine:

setlocale(LC_ALL, NULL);
setlocale(LC_ALL, "ru_RU.UTF8");
wprintf(L"Привет!\n");

However the following code will fail:

wprintf(L"anything"); // or even just `fwide(stdout, 1);`
setlocale(LC_ALL, "ru_RU.UTF8");
wprintf(L"Привет!\n");

The problem is that stream has it's own locale that is determined at the point the stream orientation is changed to wide.

// here stdout has no orientation and no locale associated with it
wprintf(L"anything");
// `stdout` stream orientation switches to wide stream
// current locale is used - `stdout` has C locale

setlocale(LC_ALL, "ru_RU.UTF8");
wprintf(L"Привет!\n");
// `stdout` is wide oriented
// current locale is ru_RU.UTF-8
// __but__ the locale of `stdout` is still C and cannot be changed!

The only documentation I found of this gnu.org Stream and I18N emphasis mine:

Since a stream is created in the unoriented state it has at that point no conversion associated with it. The conversion which will be used is determined by the LC_CTYPE category selected at the time the stream is oriented. If the locales are changed at the runtime this might produce surprising results unless one pays attention. This is just another good reason to orient the stream explicitly as soon as possible, perhaps with a call to fwide.

You can:

  • Use separate locale for C++ stream and C FILE (see here):

std::ios_base::sync_with_stdio(false);
std::wcout.imbue(std::locale("ru_RU.utf8"));
  • Reopen stdout:

wprintf(L""); // stdout has C locale
char* new_locale = setlocale(LC_ALL, "ru_RU.UTF8");
freopen("/dev/stdout", "w", stdout); // stdout has no stream orientation
wprintf(L"Привет!\n"); // stdout is wide and ru_RU locale
  • I think (untested) that in glibc you can even reopen stdout with explicit locale (see GNU opening streams):

freopen("/dev/stdout", "w,css=ru_RU.UTF-8", stdout);
std::wcout << L"Привет!\n"; // fine
  • In any case, try to set locale as soon as possible before doing anything else.

wcin.imbue and UTF-8

First of all you should use wcout with wcin.

Now you have two possible solutions to that:

1) Deactivate synchronization of iostream and cstdio streams by using

   ios_base::sync_with_stdio(false);

Note, that this should be the first call, otherwise the behavior depends on implementation.

int main() {

ios_base::sync_with_stdio(false);
wcin.imbue(locale("C.UTF-8"));

wstring s;
wcin >> s;
wcout << s.length() << " " << (s == L"áéú");
return 0;
}

2) Localize both locale and wcout:

int main() {

std::setlocale(LC_ALL, "C.UTF-8");
wcout.imbue(locale("C.UTF-8"));

wstring s;
wcin >> s;
wcout << s.length() << " " << (s == L"áéú");
return 0;
}

Tested both of them using ideone, works fine. I don't have clang++/libc++ with me, so wasn't able to test this behavior, sorry.

wcout does not output as desired

The following code works for me, using MinGW-w64 7.3.0 in both MSYS2 Bash, and Windows CMD; and with the source encoded as UTF-8:

#include <iostream>
#include <locale>
#include <string>
#include <codecvt>

int main()
{
std::ios_base::sync_with_stdio(false);

std::locale utf8( std::locale(), new std::codecvt_utf8_utf16<wchar_t> );
std::wcout.imbue(utf8);

std::wstring w(L"Bilişim Sistemleri Mühendisliğine Giriş");
std::wcout << w << '\n';
}

Explanation:

  • The Windows console doesn't support any sort of 16-bit output; it's only ANSI and a partial UTF-8 support. So you need to configure wcout to convert the output to UTF-8. This is the default for backwards compatibility purposes, though Windows 10 1803 does add an option to set that to UTF-8 (ref).
  • imbue with a codecvt_utf8_utf16 achieves this; however you also need to disable sync_with_stdio otherwise the stream doesn't even use the facet, it just defers to stdout which has a similar problem.

For writing to other files, I found the same technique works to write UTF-8. For writing a UTF-16 file you need to imbue the wofstream with a UTF-16 facet, see example here, and manually write a BOM.


Commentary: Many people just avoid trying to use wide iostreams completely, due to these issues.

You can write a UTF-8 file using a narrow stream; and have function calls in your code to convert wstring to UTF-8, if you are using wstring internally; you can of course use UTF-8 internally.

Of course you can also write a UTF-16 file using a narrow stream, just not with operator<< from a wstring.



Related Topics



Leave a reply



Submit