Specification of Source Charset Encoding in Msvc++, Like Gcc "-Finput-Charset=Charset"

Specification of source charset encoding in MSVC++, like gcc -finput-charset=CharSet

For those who subscribe to the motto "better late than never", Visual Studio 2015 (version 19 of the compiler) now supports this.

The new /source-charset command line switch allows you to specify the character set encoding used to interpret source files. It takes a single parameter, which can be either the IANA or ISO character set name:

/source-charset:utf-8

or the decimal identifier of a particular code page (preceded by a dot):

/source-charset:.65001

The official documentation is here, and there is also a detailed article describing these new options on the Visual C++ Team Blog.

There is also a complementary /execution-charset switch that works in exactly the same way but controls how narrow character- and string-literals are generated in the executable. Finally, there is a shortcut switch, /utf-8, that sets both /source-charset:utf-8 and /execution-charset:utf-8.

These command-line options are incompatible with the old #pragma setlocale and #pragma execution-character-set directives, and they apply globally to all source files.

For users stuck on older versions of the compiler, the best option is still to save your source files as UTF-8 with a BOM (as other answers have suggested, the IDE can do this when saving). The compiler will automatically detect this and behave appropriately. So, too, will GCC, which also accepts a BOM at the start of source files without choking to death, making this approach functionally portable.

Does VC have a compile option like '-fexec-charset' in GCC to set the execution character set?

AFAIK, VC++ doesn't have a commandline flag to let you specify a UTF-8 execution character set.
However it does (sporadically) support the undocumented

#pragma execution_character_set("utf-8")

referred to here.

To get the effect of a commandline flag with this pragma you can write the pragma in a header
file, say, preinclude.h and pre-include this header in every compilation by passing
the flag /FI preinclude.h. See this documentation
for how to set this flag from the IDE.

The pragma was supported in VC++ 2010, then forgotten in VC++ 2012, and is supported again
in VC++ 2013

C++ Visual Studio character encoding issues

Before I go any further, I should mention that what you are doing is not c/c++ compliant. The specification states in 2.2 what character sets are valid in source code. It ain't much in there, and all the characters used are in ascii. So... Everything below is about a specific implementation (as it happens, VC2008 on a US locale machine).

To start with, you have 4 chars on your cout line, and 4 glyphs on the output. So the issue is not one of UTF8 encoding, as it would combine multiple source chars to less glyphs.

From you source string to the display on the console, all those things play a part:

What encoding your source file is in (i.e. how your C++ file will be seen by the compiler)
What your compiler does with a string literal, and what source encoding it understands
how your << interprets the encoded string you're passing in
what encoding the console expects
how the console translates that output to a font glyph.

Now...

1 and 2 are fairly easy ones. It looks like the compiler guesses what format the source file is in, and decodes it to its internal representation. It generates the string literal corresponding data chunk in the current codepage no matter what the source encoding was. I have failed to find explicit details/control on this.

3 is even easier. Except for control codes, << just passes the data down for char *.

4 is controlled by SetConsoleOutputCP. It should default to your default system codepage. You can also figure out which one you have with GetConsoleOutputCP (the input is controlled differently, through SetConsoleCP)

5 is a funny one. I banged my head to figure out why I could not get the é to show up properly, using CP1252 (western european, windows). It turns out that my system font does not have the glyph for that character, and helpfully uses the glyph of my standard codepage (capital Theta, the same I would get if I did not call SetConsoleOutputCP). To fix it, I had to change the font I use on consoles to Lucida Console (a true type font).

Some interesting things I learned looking at this:

the encoding of the source does not matter, as long as the compiler can figure it out (notably, changing it to UTF8 did not change the generated code. My "é" string was still encoded with CP1252 as 233 0 )
VC is picking a codepage for the string literals that I do not seem to control.
controlling what the console shows is more painful than what I was expecting

So... what does this mean to you ? Here are bits of advice:

don't use non-ascii in string literals. Use resources, where you control the encoding.
make sure you know what encoding is expected by your console, and that your font has the glyphs to represent the chars you send.
if you want to figure out what encoding is being used in your case, I'd advise printing the actual value of the character as an integer. char * a = "é"; std::cout << (unsigned int) (unsigned char) a[0] does show 233 for me, which happens to be the encoding in CP1252.

BTW, if what you got was "ÓÚÛ¨" rather than what you pasted, then it looks like your 4 bytes are interpreted somewhere as CP850.

MSVC14 treats the u8 prefix differently depending on whether the source is UTF-8 or UTF-8 BOM

The compiler doesn't know what the encoding of the file is. It attempts to guess by looking at a prefix of the input. If it sees a UTF-8 encoded BOM then it assumes it is dealing with UTF-8. In the absence of that, and of any obvious UTF-16 characters, it defaults to something else. (ISO Latin 1? Whatever the common local MBCS is?)

Without the BOM the compiler fails to determine your input is UTF-8 encoded and so assumes it isn't.

It then sees each byte of the UTF-8 encoding as a single character; for the simple literal it is copied across verbatim, and for the u8 string it is encoded as UTF-8, giving the double encoding you see.

The only solution seems to be to force the BOM; alternatively, use UTF-16 which is really what the Windows platform prefers.

See also Specification of source charset encoding in MSVC++, like gcc "-finput-charset=CharSet".

C standard : Character set and string encoding specification

C is not greedy about character sets. There's no such thing as "default character set", it's implementation defined - although it's mostly ASCII or UTF-8 on most modern systems.

identifier character set (clang)

It's not so much an extension as it is Clang's interpretation of the Multibyte characters part of the standard. Clang supports UTF-8 source code files.

As to why, I guess "why not?" is the only real answer; it seems useful and reasonable to me to support a larger character set.

Here are the relevant parts of the standard (C11 draft):

5.2.1 Character sets

1 Two sets of characters and their associated collating sequences shall be defined: the set in which source files are written (the source character set), and the set interpreted in the execution environment (the execution character set). Each set is further divided into a basic character set, whose contents are given by this subclause, and a set of zero or more locale-specific members (which are not members of the basic character set) called extended characters. The combined set is also called the extended character set. The values of the members of the execution character set are implementation-defined.

2 In a character constant or string literal, members of the execution character set shall be represented by corresponding members of the source character set or by escape sequences consisting of the backslash \ followed by one or more characters. A byte with all bits set to 0, called the null character, shall exist in the basic execution character set; it is used to terminate a character string.

3 Both the basic source and basic execution character sets shall have the following members: the 26 uppercase letters of the Latin alphabet
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z
the 26 lowercase letters of the Latin alphabet
a b c d e f g h i j k l m
n o p q r s t u v w x y z
the 10 decimal digits
0 1 2 3 4 5 6 7 8 9
the following 29 graphic characters
! " # % & ' ( ) * + , - . / :
; < = > ? [ \ ] ^ _ { | } ~
the space character, and control characters representing horizontal tab, vertical tab, and form feed. The representation of each member of the source and execution basic character sets shall fit in a byte. In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous. In source files, there shall be some way of indicating the end of each line of text; this International Standard treats such an end-of-line indicator as if it were a single new-line character. In the basic execution character set, there shall be control characters representing alert, backspace, carriage return, and new line. If any other characters are encountered in a source file (except in an identifier, a character constant, a string literal, a header name, a comment, or a preprocessing token that is never
converted to a token), the behavior is undefined.

4 A letter is an uppercase letter or a lowercase letter as defined above; in this International Standard the term does not include other characters that are letters in other alphabets.

5 The universal character name construct provides a way to name other characters.

And also:

5.2.1.2 Multibyte characters

1 The source character set may contain multibyte characters, used to represent members of the extended character set. The execution character set may also contain multibyte characters, which need not have the same encoding as for the source character set. For both character sets, the following shall hold:

— The basic character set shall be present and each character shall be encoded as a single byte.

— The presence, meaning, and representation of any additional members is locale- specific.

— A multibyte character set may have a state-dependent encoding, wherein each sequence of multibyte characters begins in an initial shift state and enters other locale-specific shift states when specific multibyte characters are encountered in the sequence. While in the initial shift state, all single-byte characters retain their usual interpretation and do not alter the shift state. The interpretation for subsequent bytes in the sequence is a function of the current shift state.

— A byte with all bits zero shall be interpreted as a null character independent of shift state. Such a byte shall not occur as part of any other multibyte character.

2 For source files, the following shall hold:

— An identifier, comment, string literal, character constant, or header name shall begin and end in the initial shift state.

— An identifier, comment, string literal, character constant, or header name shall consist of a sequence of valid multibyte characters.

wstring_converter exception when parsing a c-string

My guess would be that the string literal "+3°C" is not UTF-8 encoded because your IDE is using a different source character set.

You can only embed the character ° directly into the source code if the source file itself is UTF-8 encoded. If it's using some Windows codepage that represents ° differently then it probably embeds one or more bytes into the string which are not valid UTF-8 characters, so the conversion from UTF-8 to UTF-16 fails.

It works fine in a live demo such as http://coliru.stacked-crooked.com/a/23923c288ed5f9f3 because that runs on a different OS where the compiler assumes source files use UTF-8 by default (which is standard for GNU/Linux and other platforms with saner handling of non-ASCII text).

Try replacing it with a UTF-8 literal u8"+3\u2103" (using the universal character name for the DEGREES CELSIUS character) or u8"+3\u00B0C" (using the universal character name for the DEGREE SIGN character and then a capital C).

That tells the compiler that you want a string containing the UTF-8 representation of exactly those Unicode characters, independent of the encoding of the source file itself.

What does be representable in execution character set mean?

The default execution character set of GCC is UTF-8.

And therein lies the problem. Namely, this is not true. Or at least, not in the way that the C++ standard means it.

The standard defines the "basic character set" as a collection of 96 different characters. However, it does not define an encoding for them. That is, the character "A" is part of the "basic character set". But the value of that character is not specified.

When the standard defines the "basic execution character set", it adds some characters to the basic set, but it also defines that there is a mapping from a character to a value. Outside of the NUL character being 0 however (and that the digits have to be encoded in a contiguous sequence), it lets implementations decide for themselves what that mapping is.

Here's the issue: UTF-8 is not a "character set" by any reasonable definition of that term.

Unicode is a character set; it defines a series of characters which exist and what their meanings are. It also each character in the Unicode character set a unique numeric value (a Unicode codepoint).

UTF-8 is... not that. UTF-8 is a scheme for encoding characters, typically in the Unicode character set (though it's not picky; it can work for any 21-bit number, and it can be extended to 32-bits).

So when GCC's documentation says:

[The execution character set] is under control of the user; the default is UTF-8, matching the source character set.

This statement makes no sense, since as previously stated, UTF-8 is a text encoding, not a character set.

What seems to have happened to GCC's documentation (and likely GCC's command line options) is that they've conflated the concept of "execution character set" with "narrow character encoding scheme". UTF-8 is how GCC encodes narrow character strings by default. But that's different from saying what its "execution character set" is.

That is, you can use UTF-8 to encode just the basic execution character set defined by C++. Using UTF-8 as your narrow character encoding scheme has no bearing on what your execution character set is.

Note that Visual Studio has a similarly-named option and makes a similar conflation of the two concepts. They call it the "execution character set", but they explain that the behavior of the option as:

The execution character set is the encoding used for the text of your program that is input to the compilation phase after all preprocessing steps.

So... what is GCC's execution character set? Well, since their documentation has confused "execution character set" with "narrow string encoding", it's pretty much impossible to know.

So what does the standard require out of GCC's behavior? Well, take the rule you quoted and turn it around. A single universal-character-name in a character literal will either be a char or an int, and it will only be the latter if the universal-character-name names a character not in the execution character set. So it's impossible for a system's execution character set to include more characters than char has bits to allow them.

That is, GCC's execution character set cannot be Unicode in its entirety. It must be some subset of Unicode. It can choose for it to be the subset of Unicode whose UTF-8 encoding takes up 1 char, but that's about as big as it can be.

While I've framed this as GCC's problem, it's also technically a problem in the C++ specification. The paragraph you quoted also conflates the encoding mechanism (ie: what char means) with the execution character set (ie: what characters are available to be stored).

This problem has been recognized and addressed by the addition of this wording:

A non-encodable character literal is a character-literal whose c-char-sequence consists of a single c-char that is not a numeric-escape-sequence and that specifies a character that either lacks representation in the literal's associated character encoding or that cannot be encoded as a single code unit. A multicharacter literal is a character-literal whose c-char-sequence consists of more than one c-char. The encoding-prefix of a non-encodable character literal or a multicharacter literal shall be absent or L. Such character-literals are conditionally-supported.

As these are proposed (and accepted) as resolutions for CWG issues, they also retroactively apply to previous versions of the standard.

c++ string literal still confusing

C++ doesn't have normal Unicode support. You just can't wirte normally globalized application in C++ without using 3rd party libraries. Read this insightful SO answer. If you really need to write an application which uses Unicode I'd look at ICU library.

Specification of Source Charset Encoding in Msvc++, Like Gcc "-Finput-Charset=Charset"