C++ Visual Studio Character Encoding Issues

C++ Visual Studio character encoding issues

Before I go any further, I should mention that what you are doing is not c/c++ compliant. The specification states in 2.2 what character sets are valid in source code. It ain't much in there, and all the characters used are in ascii. So... Everything below is about a specific implementation (as it happens, VC2008 on a US locale machine).

To start with, you have 4 chars on your cout line, and 4 glyphs on the output. So the issue is not one of UTF8 encoding, as it would combine multiple source chars to less glyphs.

From you source string to the display on the console, all those things play a part:

  1. What encoding your source file is in (i.e. how your C++ file will be seen by the compiler)
  2. What your compiler does with a string literal, and what source encoding it understands
  3. how your << interprets the encoded string you're passing in
  4. what encoding the console expects
  5. how the console translates that output to a font glyph.

Now...

1 and 2 are fairly easy ones. It looks like the compiler guesses what format the source file is in, and decodes it to its internal representation. It generates the string literal corresponding data chunk in the current codepage no matter what the source encoding was. I have failed to find explicit details/control on this.

3 is even easier. Except for control codes, << just passes the data down for char *.

4 is controlled by SetConsoleOutputCP. It should default to your default system codepage. You can also figure out which one you have with GetConsoleOutputCP (the input is controlled differently, through SetConsoleCP)

5 is a funny one. I banged my head to figure out why I could not get the é to show up properly, using CP1252 (western european, windows). It turns out that my system font does not have the glyph for that character, and helpfully uses the glyph of my standard codepage (capital Theta, the same I would get if I did not call SetConsoleOutputCP). To fix it, I had to change the font I use on consoles to Lucida Console (a true type font).

Some interesting things I learned looking at this:

  • the encoding of the source does not matter, as long as the compiler can figure it out (notably, changing it to UTF8 did not change the generated code. My "é" string was still encoded with CP1252 as 233 0 )
  • VC is picking a codepage for the string literals that I do not seem to control.
  • controlling what the console shows is more painful than what I was expecting

So... what does this mean to you ? Here are bits of advice:

  • don't use non-ascii in string literals. Use resources, where you control the encoding.
  • make sure you know what encoding is expected by your console, and that your font has the glyphs to represent the chars you send.
  • if you want to figure out what encoding is being used in your case, I'd advise printing the actual value of the character as an integer. char * a = "é"; std::cout << (unsigned int) (unsigned char) a[0] does show 233 for me, which happens to be the encoding in CP1252.

BTW, if what you got was "ÓÚÛ¨" rather than what you pasted, then it looks like your 4 bytes are interpreted somewhere as CP850.

Character Set Encoding of Visual Studio

Visual Studio C++, supports:



Strings:

  • string : A type that describes a specialization of the template class basic_string with elements of type char
  • u16string : A type that describes a specialization of the template class basic_string with elements of type char16_t.
  • u32string : A type that describes a specialization of the template class basic_string with elements of type char32_t.
  • wstring : A type that describes a specialization of the template class basic_string with elements of type wchar_t.

https://learn.microsoft.com/en-us/cpp/standard-library/string-typedefs?view=vs-2019



Charater literals

  • Ordinary character literals of type char, for example 'a'
  • UTF-8 character literals of type char, for example u8'a'
  • Wide-character literals of type wchar_t, for example L'a'
  • UTF-16 character literals of type char16_t, for example u'a'
  • UTF-32 character literals of type char32_t, for example U'a'

https://learn.microsoft.com/en-us/cpp/cpp/string-and-character-literals-cpp?view=vs-2019#character-literals



Encoding:

  • A character literal without a prefix is an ordinary character literal. The value of an ordinary character literal containing a single character, escape sequence, or universal character name that can be represented in the execution character set has a value equal to the numerical value of its encoding in the execution character set. An ordinary character literal that contains more than one character, escape sequence, or universal character name is a multicharacter literal. A multicharacter literal or an ordinary character literal that can't be represented in the execution character set is conditionally-supported, has type int, and its value is implementation-defined.

  • A character literal that begins with the L prefix is a wide-character literal. The value of a wide-character literal containing a single character, escape sequence, or universal character name has a value equal to the numerical value of its encoding in the execution wide-character set unless the character literal has no representation in the execution wide-character set, in which case the value is implementation-defined. The value of a wide-character literal containing multiple characters, escape sequences, or universal character names is implementation-defined.

  • A character literal that begins with the u8 prefix is a UTF-8 character literal. The value of a UTF-8 character literal containing a single character, escape sequence, or universal character name has a value equal to its ISO 10646 code point value if it can be represented by a single UTF-8 code unit (corresponding to the C0 Controls and Basic Latin Unicode block). If the value can't be represented by a single UTF-8 code unit, the program is ill-formed. A UTF-8 character literal containing more than one character, escape sequence, or universal character name is ill-formed.

  • A character literal that begins with the u prefix is a UTF-16 character literal. The value of a UTF-16 character literal containing a single character, escape sequence, or universal character name has a value equal to its ISO 10646 code point value if it can be represented by a single UTF-16 code unit (corresponding to the basic multi-lingual plane). If the value can't be represented by a single UTF-16 code unit, the program is ill-formed. A UTF-16 character literal containing more than one character, escape sequence, or universal character name is ill-formed.

  • A character literal that begins with the U prefix is a UTF-32 character literal. The value of a UTF-32 character literal containing a single character, escape sequence, or universal character name has a value equal to its ISO 10646 code point value. A UTF-8 character literal containing more than one character, escape sequence, or universal character name is ill-formed.

https://learn.microsoft.com/en-us/cpp/cpp/string-and-character-literals-cpp?view=vs-2019#encoding



#include <string>
using namespace std::string_literals; // enables s-suffix for std::string literals

int main()
{
// Character literals
auto c0 = 'A'; // char
auto c1 = u8'A'; // char
auto c2 = L'A'; // wchar_t
auto c3 = u'A'; // char16_t
auto c4 = U'A'; // char32_t

// String literals
auto s0 = "hello"; // const char*
auto s1 = u8"hello"; // const char*, encoded as UTF-8
auto s2 = L"hello"; // const wchar_t*
auto s3 = u"hello"; // const char16_t*, encoded as UTF-16
auto s4 = U"hello"; // const char32_t*, encoded as UTF-32

// Raw string literals containing unescaped \ and "
auto R0 = R"("Hello \ world")"; // const char*
auto R1 = u8R"("Hello \ world")"; // const char*, encoded as UTF-8
auto R2 = LR"("Hello \ world")"; // const wchar_t*
auto R3 = uR"("Hello \ world")"; // const char16_t*, encoded as UTF-16
auto R4 = UR"("Hello \ world")"; // const char32_t*, encoded as UTF-32

// Combining string literals with standard s-suffix
auto S0 = "hello"s; // std::string
auto S1 = u8"hello"s; // std::string
auto S2 = L"hello"s; // std::wstring
auto S3 = u"hello"s; // std::u16string
auto S4 = U"hello"s; // std::u32string

// Combining raw string literals with standard s-suffix
auto S5 = R"("Hello \ world")"s; // std::string from a raw const char*
auto S6 = u8R"("Hello \ world")"s; // std::string from a raw const char*, encoded as UTF-8
auto S7 = LR"("Hello \ world")"s; // std::wstring from a raw const wchar_t*
auto S8 = uR"("Hello \ world")"s; // std::u16string from a raw const char16_t*, encoded as UTF-16
auto S9 = UR"("Hello \ world")"s; // std::u32string from a raw const char32_t*, encoded as UTF-32
}

VS code does not recognise letters Õ, Ä, Ö, Ü but encoding is UTF-8

There actually was a super easy fix for this...

    #include <windows.h>
SetConsoleOutputCP(65001); //Set console encoding to utf8

Using the next code you can find out what is the encoding of your console and if it is wrong then it can be changed with SetConsoleOutputCP(.....):

    #include <windows.h>
unsigned cp = GetConsoleOutputCP();

And it needs to be matched - mine was not :)

What is the default encoding for source files in Visual Studio 2017?

Also there has been an option "Advanced Save Options\Encoding" which
did allow to change the encoding of newly saved files which is missing
in VS2017.

This feature Already exists! You can save files with specific character encoding to support bi-directional languages. You can also specify an encoding when opening a file, so that Visual Studio displays the file correctly.

save a file with encoding

To save a file with encoding

  1. From the File menu, choose Save File As, and then click the
    drop-down button next to the Save button
    . The Advanced Save Options
    dialog box is displayed.
  2. Under Encoding, select the encoding to use for the file.
  3. Optionally, under Line endings, select the format for end-of-line
    characters.

Are all files types saved with UTF8-BOM encoding in VS2017

In my case, VS stores all the files with CodePage 1252 encoding.

Is it possible to configure the encoding for new files in VS2017

However, My Visual Studio version is 15.6.1 and some people have the same problem like yours in previous versions of 2017, but they said "We have fixed this issue and it's available in Visual Studio 2017 15.3"

If not working, for C++ projects Take a look at /utf-8 (Set Source and Executable character sets to UTF-8).

Will VS2017 change the encoding of "old" files which don't have
UTF8-BOM

By default, Visual Studio detects a byte-order mark to determine if the source file is in an encoded Unicode format, for example, UTF-16 or UTF-8. If no byte-order mark is found, it assumes the source file is encoded using the current user code page, unless you have specified a code page by using /utf-8 or the /source-charset option.
Some people encountered a problem which is came from .editorconfig file, as below:

root = true

[*]
indent_style = tab
indent_size = 4
tab_width = 4
trim_trailing_whitespace = true
insert_final_newline = true
charset = utf-8

That final charset line is probable doing it... but I'm not asking for 'utf-8-with-bom'!



Related Topics



Leave a reply



Submit