How to Print Utf-8 Strings to Std::Cout on Windows

How to print UTF-8 strings to std::cout on Windows?

The problem is not std::cout but the windows console. Using C-stdio you will get the ü with fputs( "\xc3\xbc", stdout ); after setting the UTF-8 codepage (either using SetConsoleOutputCP or chcp) and setting a Unicode supporting font in cmd's settings (Consolas should support over 2000 characters and there are registry hacks to add more capable fonts to cmd).

If you output one byte after the other with putc('\xc3'); putc('\xbc'); you will get the double tofu as the console gets them interpreted separately as illegal characters. This is probably what the C++ streams do.

See UTF-8 output on Windows console for a lenghty discussion.

For my own project, I finally implemented a std::stringbuf doing the conversion to Windows-1252. I you really need full Unicode output, this will not really help you, however.

An alternative approach would be overwriting cout's streambuf, using fputs for the actual output:

#include <iostream>
#include <sstream>

#include <Windows.h>

class MBuf: public std::stringbuf {
public:
int sync() {
fputs( str().c_str(), stdout );
str( "" );
return 0;
}
};

int main() {
SetConsoleOutputCP( CP_UTF8 );
setvbuf( stdout, nullptr, _IONBF, 0 );
MBuf buf;
std::cout.rdbuf( &buf );
std::cout << u8"Greek: αβγδ\n" << std::flush;
}

I turned off output buffering here to prevent it to interfere with unfinished UTF-8 byte sequences.

C++11 std::cout string literal in UTF-8 to Windows cmd console? (Visual Studio 2015)

This is a partial answer found via hopping the link by luk32 and confirming the Melebius comments (see below the question). This is not the complete answer, and I will be happy to accept your follow-up comment.

I have just found the UTF-8 Everywhere Manifesto that touches the problem. The point 17. Q: How do I write UTF-8 string literal in my C++ code? says (also explicit for Microsoft C++ compiler):

However the most straightforward way is to just write the string as-is and save the source file encoded in UTF-8:

                                "∃y ∀x ¬(x ≺ y)"

Unfortunately, MSVC converts it to some ANSI codepage, corrupting the string. To work around this, save the file in UTF-8 without BOM. MSVC will assume that it is in the correct codepage and will not touch your strings. However, it renders it impossible to use Unicode identifiers and wide string literals (that you will not be using anyway).

I really like the manifesto. To make it short, using rude words, and possibly oversimplified, it says:

Ignore the wstring, wchar_t, and the like things. Ignore the codepages. Ignore the string literal prefixes like L, u, U, u8. Use UTF-8 everywhere. Write all literals "naturally". Ensure it is also stored in the compiled binary.

If the following code is stored with UTF-8 without BOM...

#include <iomanip>
#include <iostream>
#include "windows.h"

using namespace std;

int main()
{
SetConsoleOutputCP(CP_UTF8);
cout << "Příšerně žluťoučký kůň úpěl ďábelské ódy!" << endl;

int cnt = 0;
for (unsigned int c : "Příšerně žluťoučký kůň úpěl ďábelské ódy!")
{
cout << hex << setw(2) << setfill('0') << (c & 0xff);
++cnt;
if (cnt % 16 == 0) cout << endl;
else if (cnt % 8 == 0) cout << " | ";
else if (cnt % 4 == 0) cout << " ";
else cout << ' ';
}
cout << endl;
}

It prints (should be UTF-8 encoded)...

Sample Image

When saving the source as UTF-8 with BOM, it prints a different result...

Sample Image

However, the problem remains -- how to set the console encoding programmatically so that the UTF-8 string is printed correctly.

I gave up. The cmd console is simply crippled, and it is not worth to fix it from outside. I am accepting my own comment only to close the question. If anyone finds a decent solution related to the Catch unit test framework (could be completely different), I will be glad to accept his/her comment as the answer.

Can std::cout work with UTF-8 on Windows?

Here is what I'd do:

  1. make sure your source files are utf-8 encoded and have correct content (open them in another editor, check glyphs and file encoding)

  2. remove console from equation -- redirect output to a file and check it's content with utf-8-aware editor (just like with source code)

  3. use /utf-8 cmdline option with MSVC2015+ -- this will force compiler to treat all source files as utf-8 encoded once and your string literals stored in resulting binary will be utf-8 encoded.

  4. remove iostreams from equation (can't wait until for this library to die, tbh) -- use cstdio

  5. at this point output should work (it does for me)

  6. to get console output to work -- use SetConsoleOutputCP(CP_UTF8) and get it to use TrueType font that supports your Unicode plane (I suspect that for chinese characters to work in console you need a font installed in your system that supports related Unicode plane and your console should be configured to use it)

  7. not sure about console input (never had to deal with that), but I suspect that SetConsoleCP(CP_UTF8) should make it work with non-wide i/o

  8. discard the idea of using wide i/o (wcout/etc) -- why would you do it anyway? Unicode works just fine with utf-8 encoded char const*

  9. once you reached this stage -- time to deal with iostreams (if you insist on using it). I'd disregard wcin/wcout for now. If they don't already work -- try imbue'ing related cin/cout with utf-8 locale.

  10. the idea promoted by http://utf8everywhere.org/ is to convert to UCS-2 only when you make Windows API call. This makes your OutputForwarderBuffer unnecessary.

  11. I guess (if you REALLY insist) now you can try getting wide iostreams to work. Good luck, I guess you'll have to reconfigure console (which will break non-wide i/o) or somehow get your wcout/wcin performing UCS2-to-UTF8 conversion on the fly (and only if it is connected to console).

Edit:
Starting from Windows 10 you also need this:

setvbuf(stderr, NULL, _IOFBF, 1024);    // on Windows 10+ we need buffering or console will get 1 byte at a time (screwing up utf-8 encoding)
setvbuf(stdout, NULL, _IOFBF, 1024);

Unfortunately this also means that there is still a chance of screwing up your output if you fill buffer completely before next flush. Proper solution -- flush it manually (endl or fflush()) after every string sent to output (assuming each string is less than 1024). If only MS supported line-buffering...

Is it possible to print UTF-8 string with Boost and STL in windows console?

It looks like std::cout is much too clever here: it tries to interpret your utf8 encoded string as an ascii one and finds 21 non ascii characters that it outputs as the unmapped character . AFAIK Windows C++ console driver,insists on each character from a narrow char string being mapped to a position on screen and does not support multi bytes character sets.

Here what happens under the hood:

utf8_string is the following char array (just look at a Unicode table and do the utf8 conversion):

utf8_string = { '0xe2', '0x99', '0xa3', '0xe2', '0x98', '0xbb', '0xe2', '0x96',
'0xbc', '0xe2', '0x96', '0xba', '0xe2', '0x99', '0x80', '0xe2', '0x99',
'0x82', '0xe2', '0x98', '0xbc', '\0' };

that is 21 characters none of which is in the ascii range 0-0x7f.

On the opposite side, printf just outputs the byte without any conversion giving the correct output.

I'm sorry but even after many searches I could not find an easy way to correctly display UTF8 output on a windows console using a narrow stream such as std::cout.

But you should notice that your code fails to imbue the booster locale into cout

Display large UTF-8-encoded strings for standard output decently, despite Windows or MinGW bugs

I elaborated a fairly simple workaround by experimenting, of which I am surprised that nobody knew (I found nothing like that online).

N.m.'s attempted answer gave a good hint with mentioning the platform-specific function _setmode. What it does "by design" (according to this answer and this article) is to set the file translation mode, which is how the in- and output streams according to the process are handled. But at the same time, it invalidates using std::ostream / std::istream but dictates to use std::wostream / std::wistream for decently formatted in- and output streams.

For instance, using _setmode(_fileno(stdout), _O_U8TEXT) leads to that std::wcout now works well with outputting std::wstring as UTF-8, but std::cout prints out garbage characters, even on ASCII arguments. But I want to be able to mainly use std::string, especially std::cout for output. As I have mentioned, it is a rare case that the formatting for std::cout fails, so only in cases where I print out strings that may lead to this issue (potential multi-char-encoded-characters at indices of at least 1024) I want to use a special output function, say coutUtf8String(string s).

The default (untranslated) mode of _setmode is _O_BINARY. We can temporarily switch modes. So why not just switch to _O_U8TEXT, convert the UTF-8 encoded std::string object to std::wstring, use std::wcout on it, and then switch back to _O_BINARY? To stay platform-independent, one can just define the usual std::cout call when not on Windows. Here is the code:

#if defined(_WIN32) || defined(WIN32) || defined(__CYGWIN__)
#include <fcntl.h> // Also includes the non-standard file <io.h>
// (POSIX compatibility layer) to use _setmode on Windows NT.
#ifndef _O_U8TEXT // Some GCC distributions such as TDM-GCC 9.2.0 require this explicit
// definition since, depending on __MSVCRT_VERSION__, they might
// not define it.
#define _O_U8TEXT 0x40000
#endif
#endif

void coutUtf8String(string s) {
#if defined(_WIN32) || defined(WIN32) || defined(__CYGWIN__)
if (s.length() > 1024) {
// Set translation mode of wcout to UTF-8, renders cout unusable "by design"
// (see https://developercommunity.visualstudio.com/t/_setmode_filenostdout-_O_U8TEXT;--/394790#T-N411680).
if (_setmode(STDOUT_FILENO, _O_U8TEXT) != -1) {
wcout << utf8toWide(s) << flush; // We must flush before resetting the mode.
// Set translation mode of wcout to untranslated, renders cout usable again.
_setmode(STDOUT_FILENO, _O_BINARY);
} else
// Let's use wcout anyway. Since no sink (such as Eclipse's console
// window) is attached when _setmode fails, and such sinks seem to be
// the cause for wcout to fail in default mode. The UI console view
// is filled properly like this, regardless of translation modes.
wcout << utf8toWide(s) << flush;
} else
cout << s << flush;
#else
cout << s << flush;
#endif
}

wstring utf8toWide(const char* in) {
wstring out;
if (in == nullptr)
return out;
uint32_t codepoint = 0;
while (*in != 0) {
unsigned char ch = static_cast<unsigned char>(*in);
if (ch <= 0x7f)
codepoint = ch;
else if (ch <= 0xbf)
codepoint = (codepoint << 6) | (ch & 0x3f);
else if (ch <= 0xdf)
codepoint = ch & 0x1f;
else if (ch <= 0xef)
codepoint = ch & 0x0f;
else
codepoint = ch & 0x07;
++in;
if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff)) {
if (codepoint > 0xffff) {
out.append(1, static_cast<wchar_t>(0xd800 + (codepoint >> 10)));
out.append(1, static_cast<wchar_t>(0xdc00 + (codepoint & 0x03ff)));
} else if (codepoint < 0xd800 || codepoint >= 0xe000)
out.append(1, static_cast<wchar_t>(codepoint));
}
}
return out;
}

This solution is especially convenient since it does not factually deprecate UTF-8, std::string or std::cout which are mainly used for good reasons, but simply uses std::string itself and sustains platform-independency. I rather agree with this answer that adding wchar_t (and all the redundant rubbish that comes with it, such as std::wstring, std::wstringstream, std::wostream, std::wistream, std::wstreambuf) to C++ was a mistake. Only because Microsoft takes bad design decisions, one should not adopt their mistakes but rather circumvent them.

Visual confirmation:
Sample Image

c++, cout and UTF-8

This is really no surprise. Unless your terminal is set to UTF-8 coding, how does it know that s2 isn't supposed to be "(Latin small letter a with circumflex)(Euro sign)(Pipe)",
supposing that your terminal is set to ISO-8859-1 according to http://www.ascii-code.com/

By the way, cout is not "dying" as it clearly continues to produce output after your test string.

Is there a proper way to receive input from console in UTF-8 encoding?

This is the closest to the solution I've found so far:

int main(int argc, char* argv[])
{
_setmode(_fileno(stdout), _O_WTEXT);
_setmode(_fileno(stdin), _O_WTEXT);

std::wcout << L"ñeñeñe";
std::wstring in;
std::getline(std::wcin, in);
std::wcout << in;

return 0;
}

The solution depicted here went in the right direction. Problem: both stdin and stdout should be in the same configuration, because the echo of the console rewrites the input. The problem is the writing of the string with \uXXXX codes.... I am guessing how to overcome that or using #define's to overcome and clarify the text literals



Related Topics



Leave a reply



Submit