How to Cin and Cout Some Unicode Text

How can I cin and cout some unicode text?

Here is an example that shows four different methods, of which only the third (C conio) and the fourth (native Windows API) work (but only if stdin/stdout aren't redirected). Note that you still need a font that contains the character you want to show (Lucida Console supports at least Greek and Cyrillic). Note that everything here is completely non-portable, there is just no portable way to input/output Unicode strings on the terminal.

#ifndef UNICODE
#define UNICODE
#endif

#ifndef _UNICODE
#define _UNICODE
#endif

#define STRICT
#define NOMINMAX
#define WIN32_LEAN_AND_MEAN

#include <iostream>
#include <string>
#include <cstdlib>
#include <cstdio>

#include <conio.h>
#include <windows.h>

void testIostream();
void testStdio();
void testConio();
void testWindows();

int wmain() {
testIostream();
testStdio();
testConio();
testWindows();
std::system("pause");
}

void testIostream() {
std::wstring first, second;
std::getline(std::wcin, first);
if (!std::wcin.good()) return;
std::getline(std::wcin, second);
if (!std::wcin.good()) return;
std::wcout << first << second << std::endl;
}

void testStdio() {
wchar_t buffer[0x1000];
if (!_getws_s(buffer)) return;
const std::wstring first = buffer;
if (!_getws_s(buffer)) return;
const std::wstring second = buffer;
const std::wstring result = first + second;
_putws(result.c_str());
}

void testConio() {
wchar_t buffer[0x1000];
std::size_t numRead = 0;
if (_cgetws_s(buffer, &numRead)) return;
const std::wstring first(buffer, numRead);
if (_cgetws_s(buffer, &numRead)) return;
const std::wstring second(buffer, numRead);
const std::wstring result = first + second + L'\n';
_cputws(result.c_str());
}

void testWindows() {
const HANDLE stdIn = GetStdHandle(STD_INPUT_HANDLE);
WCHAR buffer[0x1000];
DWORD numRead = 0;
if (!ReadConsoleW(stdIn, buffer, sizeof buffer, &numRead, NULL)) return;
const std::wstring first(buffer, numRead - 2);
if (!ReadConsoleW(stdIn, buffer, sizeof buffer, &numRead, NULL)) return;
const std::wstring second(buffer, numRead);
const std::wstring result = first + second;
const HANDLE stdOut = GetStdHandle(STD_OUTPUT_HANDLE);
DWORD numWritten = 0;
WriteConsoleW(stdOut, result.c_str(), result.size(), &numWritten, NULL);
}
  • Edit 1: I've added a method based on conio.
  • Edit 2: I've messed around with _O_U16TEXT a bit as described in Michael Kaplan's blog, but that seemingly only had wgets interpret the (8-bit) data from ReadFile as UTF-16. I'll investigate this a bit further during the weekend.

How to read a user's input from the console into a Unicode string?

Shouldn't you be using wcin stream if you expect unicode input?

#include <iostream>
#include <string>
#include <locale>

int main()
{
using namespace std;

std::locale::global(locale("en_US.utf8"));

std::wstring s;

std::wcin >> s;

std::wcout << s;

}

Basic issue regarding full unicode in C++

You are in the gray zone of C++ unicode. Unicode initially started by an extension of the 7 bits ASCII characters, or multi-byte characters to plain 16 bits characters, what later became the BMP. Those 16 bits characters were adopted natively by languages like Java and systems like Windows. C and C++ being more conservative on a standard point of view decided that wchar_t would be an implementation dependant wide character set that could be 16 or 32 bits wide (or even more...) depending on requirement. The good side was that it was extensible, the dark side was that it was never made clear how non BMP unicode characters should be represented when wchar_t is only 16 bits.

UTF-16 was then created to allow a standard representation of those non BMP characters, with the downside that they need 2 16 bits characters, and that the std::char_traits<wchar_t>::length would again be wrong if some of them are present in a wstring.

That's the reason why most C++ implementation choosed that wchar_t basic IO would only process correctly BMP unicode characters for length to return a true number of characters.

The C++-ish way is to use char32_t based strings when full unicode support is required. In fact wstring_t and wchar_t (prefix L for litteral) are implementation dependant types, and since C++11, you also have char16_t and u16string (prefix u) that explicitely use UTF-16, or char32_t and u32string (prefix U) for full unicode support through UTF-32. The problem of storing characters outside the BMP in a u16string, is that you lose the property size of string == number of characters, which was a key reason for using wide characters instead of multi-byte characters.

One problem for u32string is that the io library still has no direct specialization for 32 bit characters, but as the converters have, you can probably use them easily when you process files with a std::basic_fstream<char32_t> (untested but according to standard should work). But you will have no standard stream for cin, cout and cerr, and will probably have to process the native from in string or u16string, and then convert everything in u32string with the help of the standard converters introduced in C++14, or the hard way if using only C++11.

The really dark side, is that as that native part currently depend on the OS, you will not be able to setup a fully portable way to process full unicode - or at least I know none.

How to print UTF-8 strings to std::cout on Windows?

The problem is not std::cout but the windows console. Using C-stdio you will get the ü with fputs( "\xc3\xbc", stdout ); after setting the UTF-8 codepage (either using SetConsoleOutputCP or chcp) and setting a Unicode supporting font in cmd's settings (Consolas should support over 2000 characters and there are registry hacks to add more capable fonts to cmd).

If you output one byte after the other with putc('\xc3'); putc('\xbc'); you will get the double tofu as the console gets them interpreted separately as illegal characters. This is probably what the C++ streams do.

See UTF-8 output on Windows console for a lenghty discussion.

For my own project, I finally implemented a std::stringbuf doing the conversion to Windows-1252. I you really need full Unicode output, this will not really help you, however.

An alternative approach would be overwriting cout's streambuf, using fputs for the actual output:

#include <iostream>
#include <sstream>

#include <Windows.h>

class MBuf: public std::stringbuf {
public:
int sync() {
fputs( str().c_str(), stdout );
str( "" );
return 0;
}
};

int main() {
SetConsoleOutputCP( CP_UTF8 );
setvbuf( stdout, nullptr, _IONBF, 0 );
MBuf buf;
std::cout.rdbuf( &buf );
std::cout << u8"Greek: αβγδ\n" << std::flush;
}

I turned off output buffering here to prevent it to interfere with unfinished UTF-8 byte sequences.

How to output and input UTF8 or UTF16 Unicode text in Windows using C++?

As @Eryk Sun mentioned in the comments I had to use _setmode(_fileno(stdin), _O_U16TEXT);

Windows UTF-8 console inputs is still (as of 2019) somewhat broken.

EDIT:

The above modification wasn't enough. I now do the following whenever I want to support UTF-8 code page and UNICODE input/output on Windows (read the code comments for more info).

int main()
{
fflush( stdout );
#if defined _MSC_VER
# pragma region WIN_UNICODE_SUPPORT_MAIN
#endif
#if defined _WIN32
// change code page to UTF-8 UNICODE
if ( !IsValidCodePage( CP_UTF8 ) )
{
return GetLastError();
}
if ( !SetConsoleCP( CP_UTF8 ) )
{
return GetLastError();
}
if ( !SetConsoleOutputCP( CP_UTF8 ) )
{
return GetLastError();
}

// change console font - post Windows Vista only
HANDLE hStdOut = GetStdHandle( STD_OUTPUT_HANDLE );
CONSOLE_FONT_INFOEX cfie;
const auto sz = sizeof( CONSOLE_FONT_INFOEX );
ZeroMemory( &cfie, sz );
cfie.cbSize = sz;
cfie.dwFontSize.Y = 14;
wcscpy_s( cfie.FaceName,
L"Lucida Console" );
SetCurrentConsoleFontEx( hStdOut,
false,
&cfie );

// change file stream translation mode
_setmode( _fileno( stdout ), _O_U16TEXT );
_setmode( _fileno( stderr ), _O_U16TEXT );
_setmode( _fileno( stdin ), _O_U16TEXT );
#endif
#if defined _MSC_VER
# pragma endregion
#endif
std::ios_base::sync_with_stdio( false );
// program:...

return 0;
}

Guidelines:

  • Use "Use Windows Character Set" in Project Properties -> General -> Character Set
  • Make sure you use a terminal font that supports unicode utf-8 (Open a Console -> Properties -> Font -> "Lucida console" is ideal on Windows). The code above sets that automatically.
  • Use string and 8 bit chars.
  • Use 16 bit chars (wchar_t, wstring etc.) to interact with the Windows console
  • Use 8bit chars/string at application boundary (eg write to files, interact with other OSs etc.)
  • Convert string|char to wstring|wchar_t for interacting with the Windows APIs

For some reason after inputting cin text, the cout comes out blank. Any ideas?

What threw me off is when you said it allows me to type the name of the character I want to choose

In that case, go ahead with comparing the strings:

EDIT: As Mohammed suggested, comparing strings can be done directly:

string input;

cout<<"Choose your Character- 1.Sven or 2.Macy: ";
cin>>input;
cin.ignore();

if ( input == "Sven" ){
cout<<"Welcome to CRPG, my good Sir!";
}

else if ( input == "Macy"){
cout<<"Girls cant fight, go back: ";
}

How to print Unicode character in C++?

To represent the character you can use Universal Character Names (UCNs). The character 'ф' has the Unicode value U+0444 and so in C++ you could write it '\u0444' or '\U00000444'. Also if the source code encoding supports this character then you can just write it literally in your source code.

// both of these assume that the character can be represented with
// a single char in the execution encoding
char b = '\u0444';
char a = 'ф'; // this line additionally assumes that the source character encoding supports this character

Printing such characters out depends on what you're printing to. If you're printing to a Unix terminal emulator, the terminal emulator is using an encoding that supports this character, and that encoding matches the compiler's execution encoding, then you can do the following:

#include <iostream>

int main() {
std::cout << "Hello, ф or \u0444!\n";
}

This program does not require that 'ф' can be represented in a single char. On OS X and most any modern Linux install this will work just fine, because the source, execution, and console encodings will all be UTF-8 (which supports all Unicode characters).

Things are harder with Windows and there are different possibilities with different tradeoffs.

Probably the best, if you don't need portable code (you'll be using wchar_t, which should really be avoided on every other platform), is to set the mode of the output file handle to take only UTF-16 data.

#include <iostream>
#include <io.h>
#include <fcntl.h>

int main() {
_setmode(_fileno(stdout), _O_U16TEXT);
std::wcout << L"Hello, \u0444!\n";
}

Portable code is more difficult.

Can std::cout work with UTF-8 on Windows?

Here is what I'd do:

  1. make sure your source files are utf-8 encoded and have correct content (open them in another editor, check glyphs and file encoding)

  2. remove console from equation -- redirect output to a file and check it's content with utf-8-aware editor (just like with source code)

  3. use /utf-8 cmdline option with MSVC2015+ -- this will force compiler to treat all source files as utf-8 encoded once and your string literals stored in resulting binary will be utf-8 encoded.

  4. remove iostreams from equation (can't wait until for this library to die, tbh) -- use cstdio

  5. at this point output should work (it does for me)

  6. to get console output to work -- use SetConsoleOutputCP(CP_UTF8) and get it to use TrueType font that supports your Unicode plane (I suspect that for chinese characters to work in console you need a font installed in your system that supports related Unicode plane and your console should be configured to use it)

  7. not sure about console input (never had to deal with that), but I suspect that SetConsoleCP(CP_UTF8) should make it work with non-wide i/o

  8. discard the idea of using wide i/o (wcout/etc) -- why would you do it anyway? Unicode works just fine with utf-8 encoded char const*

  9. once you reached this stage -- time to deal with iostreams (if you insist on using it). I'd disregard wcin/wcout for now. If they don't already work -- try imbue'ing related cin/cout with utf-8 locale.

  10. the idea promoted by http://utf8everywhere.org/ is to convert to UCS-2 only when you make Windows API call. This makes your OutputForwarderBuffer unnecessary.

  11. I guess (if you REALLY insist) now you can try getting wide iostreams to work. Good luck, I guess you'll have to reconfigure console (which will break non-wide i/o) or somehow get your wcout/wcin performing UCS2-to-UTF8 conversion on the fly (and only if it is connected to console).

Edit:
Starting from Windows 10 you also need this:

setvbuf(stderr, NULL, _IOFBF, 1024);    // on Windows 10+ we need buffering or console will get 1 byte at a time (screwing up utf-8 encoding)
setvbuf(stdout, NULL, _IOFBF, 1024);

Unfortunately this also means that there is still a chance of screwing up your output if you fill buffer completely before next flush. Proper solution -- flush it manually (endl or fflush()) after every string sent to output (assuming each string is less than 1024). If only MS supported line-buffering...



Related Topics



Leave a reply



Submit