How can I use std::imbue to set the locale for std::wcout?
In this answer, I'm taking the questions in reverse order, and adding another (with answer) that came up along the way.
Is there way to use imbue
rather than setting the global locale to do what I want?
Yes. By default, std::wcout
is synchronized to the underlying stdout
C stream. So std::wcout
can use imbue
if that synchronization is turned off, allowing the C++ stream to operate independently. So to modify the original code to use imbue
and work as intended only a single line need be added, calling std::ios_base::sync_with_stdio
:
std::ios_base::sync_with_stdio(false);
std::wcout.imbue(ru);
Why didn't the original version work?
The standard (I'm referring to INCITS/ISO/IEC 14882-2011[2012]) says very little about the tie to the underlying stdio
stream, but in 27.4.3 it says
The object
wcout
controls output to a stream buffer associated with the objectstdout
, declared in<cstdio>
Further, without explicitly setting a global locale, the locale is the "C"
locale which is US English ASCII, so this appears to imply that stdout
will, by default, have an ASCII mapping. Since no Cyrillic characters are represented in ASCII, the underlying stdout
is what converts the proper Russian into a series of ?
characters.
Why must the sync_with_stdio
call precede imbue
?
According to 27.5.3.4 of the standard:
If any input or output operation has occurred using the standard streams prior to the call,
the effect is implementation-defined. Otherwise, called with a false argument, it allows the standard streams to operate independently of the standard C streams.
How to handle multiple locales for ifstream, cout, etc, in c++
This works for me as expected on my Linux machine, but not on my Windows machine under Cygwin (the set of available locales is apparently the same on both machines, but std::locale::locale
just fails with every imaginable locale string).
#include <iostream>
#include <fstream>
#include <locale>
#include <string>
void printFile(const char* name, const char* loc)
{
try {
std::wifstream inFile;
inFile.imbue(std::locale(loc));
inFile.open(name);
std::wstring line;
while (getline(inFile, line))
std::wcout << line << '\n';
} catch (std::exception& e) {
std::cerr << e.what() << std::endl;
}
}
int main()
{
std::locale::global(std::locale("en_US.utf8"));
printFile ("gtext-u8.txt", "de_DE.utf8"); // utf-8 text: grüßen
printFile ("gtext-legacy.txt", "de_DE@euro"); // iso8859-15 text: grüßen
}
Output:
grüßen
grüßen
std::wcin.eof(), UTF-8 and locales on different systems
This is a libc++ bug.
Note the bug report says that it only affects std::wcin
and not file streams, but in my experiments this is not the case. All wchar_t
streams seem to be affected.
The other major open source implementation, libstdc++, doesn't have this bug. It is possible to sidestep the libc++ bug by building the entire application (including all dynamic libraries, if any) against libstdc++.
If this is not an option, then one way to cope with the bug is to use narrow char
streams, and then, when needed, recode the characters (presumably arriving encoded as UTF-8) to wchar_t
(presumably UCS-4) separately. Another way is to get rid of wchar_t
altogether and work in UTF-8 throughout the program, which is probably better in the long run.
Multiple calls to setlocale
Is there something that setlocale(LC_ALL, NULL) does that needs to be taken care of in future setlocale calls?
No, setlocale(..., NULL)
does not modify the current locale. The following code is fine:
setlocale(LC_ALL, NULL);
setlocale(LC_ALL, "ru_RU.UTF8");
wprintf(L"Привет!\n");
However the following code will fail:
wprintf(L"anything"); // or even just `fwide(stdout, 1);`
setlocale(LC_ALL, "ru_RU.UTF8");
wprintf(L"Привет!\n");
The problem is that stream has it's own locale that is determined at the point the stream orientation is changed to wide.
// here stdout has no orientation and no locale associated with it
wprintf(L"anything");
// `stdout` stream orientation switches to wide stream
// current locale is used - `stdout` has C locale
setlocale(LC_ALL, "ru_RU.UTF8");
wprintf(L"Привет!\n");
// `stdout` is wide oriented
// current locale is ru_RU.UTF-8
// __but__ the locale of `stdout` is still C and cannot be changed!
The only documentation I found of this gnu.org Stream and I18N emphasis mine:
Since a stream is created in the unoriented state it has at that point no conversion associated with it. The conversion which will be used is determined by the LC_CTYPE category selected at the time the stream is oriented. If the locales are changed at the runtime this might produce surprising results unless one pays attention. This is just another good reason to orient the stream explicitly as soon as possible, perhaps with a call to fwide.
You can:
- Use separate locale for C++ stream and C
FILE
(see here):
std::ios_base::sync_with_stdio(false);
std::wcout.imbue(std::locale("ru_RU.utf8"));
- Reopen
stdout
:
wprintf(L""); // stdout has C locale
char* new_locale = setlocale(LC_ALL, "ru_RU.UTF8");
freopen("/dev/stdout", "w", stdout); // stdout has no stream orientation
wprintf(L"Привет!\n"); // stdout is wide and ru_RU locale
- I think (untested) that in glibc you can even reopen
stdout
with explicit locale (see GNU opening streams):
freopen("/dev/stdout", "w,css=ru_RU.UTF-8", stdout);
std::wcout << L"Привет!\n"; // fine
- In any case, try to set locale as soon as possible before doing anything else.
wcin.imbue and UTF-8
First of all you should use wcout with wcin.
Now you have two possible solutions to that:
1) Deactivate synchronization of iostream and cstdio streams by using
ios_base::sync_with_stdio(false);
Note, that this should be the first call, otherwise the behavior depends on implementation.
int main() {
ios_base::sync_with_stdio(false);
wcin.imbue(locale("C.UTF-8"));
wstring s;
wcin >> s;
wcout << s.length() << " " << (s == L"áéú");
return 0;
}
2) Localize both locale and wcout:
int main() {
std::setlocale(LC_ALL, "C.UTF-8");
wcout.imbue(locale("C.UTF-8"));
wstring s;
wcin >> s;
wcout << s.length() << " " << (s == L"áéú");
return 0;
}
Tested both of them using ideone, works fine. I don't have clang++/libc++ with me, so wasn't able to test this behavior, sorry.
wcout does not output as desired
The following code works for me, using MinGW-w64 7.3.0 in both MSYS2 Bash, and Windows CMD; and with the source encoded as UTF-8:
#include <iostream>
#include <locale>
#include <string>
#include <codecvt>
int main()
{
std::ios_base::sync_with_stdio(false);
std::locale utf8( std::locale(), new std::codecvt_utf8_utf16<wchar_t> );
std::wcout.imbue(utf8);
std::wstring w(L"Bilişim Sistemleri Mühendisliğine Giriş");
std::wcout << w << '\n';
}
Explanation:
- The Windows console doesn't support any sort of 16-bit output; it's only ANSI and a partial UTF-8 support. So you need to configure
wcout
to convert the output to UTF-8. This is the default for backwards compatibility purposes, though Windows 10 1803 does add an option to set that to UTF-8 (ref). imbue
with acodecvt_utf8_utf16
achieves this; however you also need to disablesync_with_stdio
otherwise the stream doesn't even use the facet, it just defers tostdout
which has a similar problem.
For writing to other files, I found the same technique works to write UTF-8. For writing a UTF-16 file you need to imbue the wofstream
with a UTF-16 facet, see example here, and manually write a BOM.
Commentary: Many people just avoid trying to use wide iostreams completely, due to these issues.
You can write a UTF-8 file using a narrow stream; and have function calls in your code to convert wstring
to UTF-8, if you are using wstring
internally; you can of course use UTF-8 internally.
Of course you can also write a UTF-16 file using a narrow stream, just not with operator<<
from a wstring
.
Related Topics
Initializing a C++ Std::Istringstream from an in Memory Buffer
Differencebetween Using a Struct with Two Fields and a Pair
How to Select a Random Element in Std::Set
Somehow Register My Classes in a List
Union for Uint32_T and Uint8_T[4] Undefined Behavior
Spirit-Qi: How to Write a Nonterminal Parser
Load Image with Opencv Mat C++
Where Are Member Functions Stored for an Object
Std::Map Default Value for Build-In Type
Std::Vector to String with Custom Delimiter
Program Is Generating Same Random Numbers on Each Run
C++ Concept That Requires a Member Function with an Outputiterator as Parameter