How to Create a Utf-8 String Literal in Visual C++ 2008

How to create a UTF-8 string literal in Visual C++ 2008

Update:

I've decided that there is no guaranteed way to do this. The solution that I present below works for English version VC2003, but fails when compiling with Japanese version VC2003 (or perhaps it is Japanese OS). In any case, it cannot be depended on to work. Note that even declaring everything as L"" strings didn't work (and is painful in gcc as described below).

Instead I believe that you just need to bite the bullet and move all text into a data file and load it from there. I am now storing and accessing the text in INI files via SimpleIni (cross-platform INI-file library). At least there is a guarantee that it works as all text is out of the program.

Original:

I'm answering this myself since only Evan appeared to understand the problem. The answers regarding what Unicode is and how to use wchar_t are not relevant for this problem as this is not about internationalization, nor a misunderstanding of Unicode, character encodings. I appreciate your attempt to help though, apologies if I wasn't clear enough.

The problem is that I have source files that need to be cross-compiled under a variety of platforms and compilers. The program does UTF-8 processing. It doesn't care about any other encodings. I want to have string literals in UTF-8 like currently works with gcc and vc2003. How do I do it with VC2008? (i.e. backward compatible solution).

This is what I have found:

gcc (v4.3.2 20081105):

string literals are used as is (raw strings)
supports UTF-8 encoded source files
source files must not have a UTF-8 BOM

vc2003:

string literals are used as is (raw strings)
supports UTF-8 encoded source files
source files may or may not have a UTF-8 BOM (it doesn't matter)

vc2005+:

string literals are massaged by the compiler (no raw strings)
char string literals are re-encoded to a specified locale
UTF-8 is not supported as a target locale
source files must have a UTF-8 BOM

So, the simple answer is that for this particular purpose, VC2005+ is broken and does not supply a backward compatible compile path. The only way to get Unicode strings into the compiled program is via UTF-8 + BOM + wchar which means that I need to convert all strings back to UTF-8 at time of use.

There isn't any simple cross-platform method of converting wchar to UTF-8, for instance, what size and encoding is the wchar in? On Windows, UTF-16. On other platforms? It varies. See the ICU project for some details.

In the end I decided that I will avoid the conversion cost on all compilers other than vc2005+ with source like the following.

#if defined(_MSC_VER) && _MSC_VER > 1310
// Visual C++ 2005 and later require the source files in UTF-8, and all strings 
// to be encoded as wchar_t otherwise the strings will be converted into the 
// local multibyte encoding and cause errors. To use a wchar_t as UTF-8, these 
// strings then need to be convert back to UTF-8. This function is just a rough 
// example of how to do this.
# define utf8(str)  ConvertToUTF8(L##str)
const char * ConvertToUTF8(const wchar_t * pStr) {
    static char szBuf[1024];
    WideCharToMultiByte(CP_UTF8, 0, pStr, -1, szBuf, sizeof(szBuf), NULL, NULL);
    return szBuf;
}
#else
// Visual C++ 2003 and gcc will use the string literals as is, so the files 
// should be saved as UTF-8. gcc requires the files to not have a UTF-8 BOM.
# define utf8(str)  str
#endif

Note that this code is just a simplified example. Production use would need to clean it up in a variety of ways (thread-safety, error checking, buffer size checks, etc).

This is used like the following code. It compiles cleanly and works correctly in my tests on gcc, vc2003, and vc2008:

std::string mText;
mText = utf8("Chinese (Traditional)");
mText = utf8("中国語 (繁体)");
mText = utf8("중국어 (번체)");
mText = utf8("Chinês (Tradicional)");

Is it possible to use UTF-8 encoding by default in Visual Studio 2008?

This blog article looks promising: UTF-8 strings and Visual C++

Most of the important content is still there, even though some of the pictures are broken. In short:

First step, you have to make sure the source file is UTF-8 encoded with the byte order mark (BOM). The BOM is an extremely important thing, without it the C++ compiler will not behave correctly.

In Visual Studio 2008, this can be done directly from the IDE with the Advanced save command located in the File menu. A dialog box will pop up. Select UTF-8 with signature.

If you compile and run a test program, [you are not going to get the expected result.] What happens is that, although your text is properly encoded in UTF-8, for compatibility reasons the C/C++ runtime is by default set to the “C” locale. This locale assumes that all char are 1 byte. Erm. Not quite the case with UTF-8 my dear!

You need to change the locale with the setlocale function to have the string properly interpreted by the input output stream processors.

In our case, the locale of whatever the system is using is fine, this is done in passing “” as the second parameter.

To be rigorous, you must check the return value of setlocale, if it returns 0, an error occurred. In multi-language applications, you will need to use setlocale with more precision, explicitly supplying the locale you want to use (for example you may want to have your application display Russian text on a Japanese computer).

I don't know of any good way to make this the default. I'm pretty sure it's not possible. Windows applications strongly prefer UTF-16, if you're compiling for Unicode. If at all possible, you should convert to that format.

Otherwise, the best possible option I can come up with is to define a simple macro (something akin to _T("string") defined in the Windows headers) that converts to UTF-8 using the above logic.

C++11 std::cout string literal in UTF-8 to Windows cmd console? (Visual Studio 2015)

This is a partial answer found via hopping the link by luk32 and confirming the Melebius comments (see below the question). This is not the complete answer, and I will be happy to accept your follow-up comment.

I have just found the UTF-8 Everywhere Manifesto that touches the problem. The point 17. Q: How do I write UTF-8 string literal in my C++ code? says (also explicit for Microsoft C++ compiler):

However the most straightforward way is to just write the string as-is and save the source file encoded in UTF-8:
                                "∃y ∀x ¬(x ≺ y)"
Unfortunately, MSVC converts it to some ANSI codepage, corrupting the string. To work around this, save the file in UTF-8 without BOM. MSVC will assume that it is in the correct codepage and will not touch your strings. However, it renders it impossible to use Unicode identifiers and wide string literals (that you will not be using anyway).

I really like the manifesto. To make it short, using rude words, and possibly oversimplified, it says:

Ignore the wstring, wchar_t, and the like things. Ignore the codepages. Ignore the string literal prefixes like L, u, U, u8. Use UTF-8 everywhere. Write all literals "naturally". Ensure it is also stored in the compiled binary.

If the following code is stored with UTF-8 without BOM...

#include <iomanip>
#include <iostream>
#include "windows.h"

using namespace std;

int main()
{
    SetConsoleOutputCP(CP_UTF8);
    cout << "Příšerně žluťoučký kůň úpěl ďábelské ódy!" << endl;

    int cnt = 0;
    for (unsigned int c : "Příšerně žluťoučký kůň úpěl ďábelské ódy!") 
    {
        cout << hex << setw(2) << setfill('0') << (c & 0xff);
        ++cnt;
        if (cnt % 16 == 0)      cout << endl;
        else if (cnt % 8 == 0)  cout << " | ";
        else if (cnt % 4 == 0)  cout << "  ";
        else                    cout << ' ';
    }
    cout << endl;
}

It prints (should be UTF-8 encoded)...

Sample Image

When saving the source as UTF-8 with BOM, it prints a different result...

Sample Image

However, the problem remains -- how to set the console encoding programmatically so that the UTF-8 string is printed correctly.

I gave up. The cmd console is simply crippled, and it is not worth to fix it from outside. I am accepting my own comment only to close the question. If anyone finds a decent solution related to the Catch unit test framework (could be completely different), I will be glad to accept his/her comment as the answer.

Is there an easy way to write UTF-8 octets in Visual Studio?

You can use the still undocumented pragma directive execution_character_set("utf-8"). This way your char strings will be saved as UTF-8 in your binary. BTW, this pragma is available in Visual C++ compilers only.

#include <iostream>
#include <cstring>

#pragma execution_character_set("utf-8")

using namespace std;

char *five_chars = "ĄĘĆŻ!";

int _tmain(int argc, _TCHAR* argv[])
{
    cout << "This is an UTF-8 string: " << five_chars << endl;
    cout << "...it's 5 characters long" << endl;
    cout << "...but it's " << strlen(five_chars) << " bytes long" << endl;
    return 0;
}

C++ Visual Studio character encoding issues

Before I go any further, I should mention that what you are doing is not c/c++ compliant. The specification states in 2.2 what character sets are valid in source code. It ain't much in there, and all the characters used are in ascii. So... Everything below is about a specific implementation (as it happens, VC2008 on a US locale machine).

To start with, you have 4 chars on your cout line, and 4 glyphs on the output. So the issue is not one of UTF8 encoding, as it would combine multiple source chars to less glyphs.

From you source string to the display on the console, all those things play a part:

What encoding your source file is in (i.e. how your C++ file will be seen by the compiler)
What your compiler does with a string literal, and what source encoding it understands
how your << interprets the encoded string you're passing in
what encoding the console expects
how the console translates that output to a font glyph.

Now...

1 and 2 are fairly easy ones. It looks like the compiler guesses what format the source file is in, and decodes it to its internal representation. It generates the string literal corresponding data chunk in the current codepage no matter what the source encoding was. I have failed to find explicit details/control on this.

3 is even easier. Except for control codes, << just passes the data down for char *.

4 is controlled by SetConsoleOutputCP. It should default to your default system codepage. You can also figure out which one you have with GetConsoleOutputCP (the input is controlled differently, through SetConsoleCP)

5 is a funny one. I banged my head to figure out why I could not get the é to show up properly, using CP1252 (western european, windows). It turns out that my system font does not have the glyph for that character, and helpfully uses the glyph of my standard codepage (capital Theta, the same I would get if I did not call SetConsoleOutputCP). To fix it, I had to change the font I use on consoles to Lucida Console (a true type font).

Some interesting things I learned looking at this:

the encoding of the source does not matter, as long as the compiler can figure it out (notably, changing it to UTF8 did not change the generated code. My "é" string was still encoded with CP1252 as 233 0 )
VC is picking a codepage for the string literals that I do not seem to control.
controlling what the console shows is more painful than what I was expecting

So... what does this mean to you ? Here are bits of advice:

don't use non-ascii in string literals. Use resources, where you control the encoding.
make sure you know what encoding is expected by your console, and that your font has the glyphs to represent the chars you send.
if you want to figure out what encoding is being used in your case, I'd advise printing the actual value of the character as an integer. char * a = "é"; std::cout << (unsigned int) (unsigned char) a[0] does show 233 for me, which happens to be the encoding in CP1252.

BTW, if what you got was "ÓÚÛ¨" rather than what you pasted, then it looks like your 4 bytes are interpreted somewhere as CP850.

Converting C-Strings from Local Encoding to UTF8

Converting from UTF-16 to UTF-8 is purely a mechanical process, but converting from local encoding to UTF-16 or UTF-8 involves some large specialized lookup tables. The c-runtime just turns around and calls WideCharToMultiByte and MultiByteToWideChar for non-trivial cases.

As for having to use UTF-16 as an intermediate stage, as far as I know, there isn't any way around that - sorry.

Since you are already linking to an external library to get file input, you might as well link to the same library to get WideCharToMultiByte and MultiByteToWideChar.

Using the c-runtime will make your code re-compilable to other operating systems (in theory), but it also adds a layer of overhead between you and the library that does all of the real work in this case - kernel32.dll.

How to define a string literal containing non-ASCII characters?

well, i always thought that by default VS uses UTF-8 to save files. But ÷ is F7 in encoding ISO 8859-1. If this is not enough for you go here: how to change source file encoding in csharp project (visual studio / msbuild machine)?

How to Create a Utf-8 String Literal in Visual C++ 2008