Character Sets - Not Clear

Character sets - Not clear

You need do distinguish between the source character set, the execution character set, the wire execution character set and it's basic versions:

The basic source character set:

§2.1.1: The basic source character set consists of 96 characters […]

This character set has exactly 96 characters. They fit into 7 bit. Characters like @ are not included.

Let's get some example binary representations for a few basic source characters. They can be completely arbitrary and there is no need these correspond to ASCII values.

A -> 0000000
B -> 0100100
C -> 0011101

The basic execution character set …

§2.1.3: The basic execution character set and the basic execution wide-character set shall each contain all the members of the basic source character set, plus control characters representing alert, backspace, and carriage return, plus a null character (respectively, null wide character), whose representation has all zero bits.

As stated the basic execution character set contains all members of basic source character set. It still doesn't include any other character like @. The basic execution character set can have a different binary representation.

As stated the basic execution character set contains representations for carriage return, a null character and other characters.

A          -> 10110101010
B          -> 00001000101    <- basic source character set
C          -> 10101011111
----------------------------------------------------------
null       -> 00000000000
Backspace  -> 11111100011

If the basic execution character set is 11 bits long (like in this example) the char data type shall be large enough to store 11 bits but it may be longer.

… and The basic execution wide character set:

The basic execution wide character is used for wide characters (wchar_t). It basicallly the same as the basic execution wide character set but can have different binary representations as well.

A          -> 1011010101010110101010
B          -> 0000100010110101011111    <- basic source character set
C          -> 1010100101101000011011
---------------------------------------------------------------------
null       -> 0000000000000000000000
Backspace  -> 1111110001100000000001

The only fixed member is the null character which needs to be a sequence of 0 bits.

Converting between basic character sets:

§2.1.1.5: Each source character set member, escape sequence, or universal-character-name in character literals and string literals is converted to a member of the execution character set (2.13.2, 2.13.4).

Then a c++ source file is compiled each character of the source character set is converted into the basic execution (wide) character set.

Example:

const char* string0   =  "BA\bC";
const wchar_t string1 = L"BA\bC";

Since string0 is a normal character it will be converted to the basic execution character set and string1 will be converted to the basic execution wide character set.

string0 -> 00001000101 10110101010 11111100011 10101011111
string1 -> 0000100010110101011111 1011010101010110101010    // continued
           1111110001100000000001 1010100101101000011011

Something about file encodings:

There are several kind of file encodings. For example ASCII which is 7 bit long. Windows-1252 which is 8 bit long (known as ANSI).
ASCII doesn't contain non-English characters. ANSI contains some European characters like ä Ö ä Õ ø.

Newer file encodings like UTF-8 or UTF-32 can contain characters of any language. UTF-8 is characters are variable in length. UTF-32 are 32 bit characters long.

File enconding requirements:

Most compilers offer command line switch to specify the file encoding of the source file.

A c++ source file needs to be encoded in an file encoding which has a representation of the basic source character set. For example: The file encoding of the source file needs to have a representation of the ; character.

If you can type the character ; within the encoding chosen as the encoding of the source file that encoding is not suitable as a c++ source file encoding.

Non-basic character sets:

Characters not included in the basic source character set belong to the source character set. The source character set is equivalent to the file encoding.

For example: the @ character is not include in the basic source character but it may be included in the source character set. The chosen file encoding of the input source file might contain a representation of @. If it doesn't contain a representation for @ you can't use the character @ within strings.

Characters not included in the basic (wide) character set belong to the execution (wide) character set.

Remember that the compiler converts the character from the source character set to the execution character set and the execution wide character set. Therefore there needs to be way how these characters can be converted.

For example: If you specify Windows-1252 as the encoding of the source character set and specify ASCII as the execution wide character set there is no way to convert this string:

const char* string0 = "string with European characters ö, Ä, ô, Ð.";

These characters can not be represented in ASCII.

Specifying character sets:

Here are some examples how to specify the character sets using gcc. The default values are included.

-finput-charset=UTF-8         <- source character set
-fexec-charset=UTF-8          <- execution character set
-fwide-exec-charset=UTF-32    <- execution wide character set

With UTF-8 and UTF-32 as default encoding c++ source files can contain strings with character of any language. UTF-8 characters can the converted both ways without problems.

The extended character set:

§1.1.3: multibyte character, a sequence of one or more bytes representing a member of the extended character set of either the source or the execution environment. The extended character set is a superset of the basic character set (2.2).

Multibyte character are longer than an entry of the normal characters. They contain an escape sequence marking them as multibyte character.

Multibyte characters are processed according the locale set in the user's runtime environment. These multibyte characters are converted at runtime to the encoding set in user's environment.

What does be representable in execution character set mean?

The default execution character set of GCC is UTF-8.

And therein lies the problem. Namely, this is not true. Or at least, not in the way that the C++ standard means it.

The standard defines the "basic character set" as a collection of 96 different characters. However, it does not define an encoding for them. That is, the character "A" is part of the "basic character set". But the value of that character is not specified.

When the standard defines the "basic execution character set", it adds some characters to the basic set, but it also defines that there is a mapping from a character to a value. Outside of the NUL character being 0 however (and that the digits have to be encoded in a contiguous sequence), it lets implementations decide for themselves what that mapping is.

Here's the issue: UTF-8 is not a "character set" by any reasonable definition of that term.

Unicode is a character set; it defines a series of characters which exist and what their meanings are. It also each character in the Unicode character set a unique numeric value (a Unicode codepoint).

UTF-8 is... not that. UTF-8 is a scheme for encoding characters, typically in the Unicode character set (though it's not picky; it can work for any 21-bit number, and it can be extended to 32-bits).

So when GCC's documentation says:

[The execution character set] is under control of the user; the default is UTF-8, matching the source character set.

This statement makes no sense, since as previously stated, UTF-8 is a text encoding, not a character set.

What seems to have happened to GCC's documentation (and likely GCC's command line options) is that they've conflated the concept of "execution character set" with "narrow character encoding scheme". UTF-8 is how GCC encodes narrow character strings by default. But that's different from saying what its "execution character set" is.

That is, you can use UTF-8 to encode just the basic execution character set defined by C++. Using UTF-8 as your narrow character encoding scheme has no bearing on what your execution character set is.

Note that Visual Studio has a similarly-named option and makes a similar conflation of the two concepts. They call it the "execution character set", but they explain that the behavior of the option as:

The execution character set is the encoding used for the text of your program that is input to the compilation phase after all preprocessing steps.

So... what is GCC's execution character set? Well, since their documentation has confused "execution character set" with "narrow string encoding", it's pretty much impossible to know.

So what does the standard require out of GCC's behavior? Well, take the rule you quoted and turn it around. A single universal-character-name in a character literal will either be a char or an int, and it will only be the latter if the universal-character-name names a character not in the execution character set. So it's impossible for a system's execution character set to include more characters than char has bits to allow them.

That is, GCC's execution character set cannot be Unicode in its entirety. It must be some subset of Unicode. It can choose for it to be the subset of Unicode whose UTF-8 encoding takes up 1 char, but that's about as big as it can be.

While I've framed this as GCC's problem, it's also technically a problem in the C++ specification. The paragraph you quoted also conflates the encoding mechanism (ie: what char means) with the execution character set (ie: what characters are available to be stored).

This problem has been recognized and addressed by the addition of this wording:

A non-encodable character literal is a character-literal whose c-char-sequence consists of a single c-char that is not a numeric-escape-sequence and that specifies a character that either lacks representation in the literal's associated character encoding or that cannot be encoded as a single code unit. A multicharacter literal is a character-literal whose c-char-sequence consists of more than one c-char. The encoding-prefix of a non-encodable character literal or a multicharacter literal shall be absent or L. Such character-literals are conditionally-supported.

As these are proposed (and accepted) as resolutions for CWG issues, they also retroactively apply to previous versions of the standard.

What's the value of characters in execution character set?

Of course it can be ASCII's 65, if the execution character set is ASCII or a superset (such as UTF-8).

It doesn't say "it can't be ASCII", it says that it is something called "the execution character set".

What assumption is safe for a C++ implementation's character set?

The word "byte" seems to be used sloppily in the first quote. As far as C++ is concerned, a byte is always a char, but the number of bits it holds is platform-dependent (and available in CHAR_BITS). Sometimes you want to say "a byte is eight bits", in which case you get a different meaning, and that may have been the intended meaning in the phrase "a char has four bytes".
The execution character set may very well be larger than or incompatible with the input character set provided by the environment. Trigraphs and alternate tokens exist to allow the representation of execution-set characters with fewer input characters on such restricted platforms (e.g. not is identical for all purposes to !, and the latter is not available in all character sets or keyboard layouts).

Contradiction in C18 standard (regarding character sets)?

Meaning that the source file character set is decoded and mapped to the source character set.

No, it does not mean that. My take is that the source is already assumed to be written in the source character set - how exactly would it make sense to "map the source character set to the source character set"? Either they are part of the set or they aren't. If you pick the wrong encoding for your source code, it will simply be rejected before the preprocessing even starts.

Translation phase 1 does two things not quite related to this at all:

Resolves trigraphs, which are standardized multibyte sequences.
Map multibyte characters into the source character set (defined in 5.2.1).
The source character set consists of the basic character set which is essentially the Latin alphabet plus various common symbols (5.2.1/3), and an extended character set, which is locale- and implemention-specific.
The definition of multibyte characters is found at 5.2.1.2:
The source character set may contain multibyte characters, used to represent members of
the extended character set. The execution character set may also contain multibyte
characters, which need not have the same encoding as for the source character set.
Meaning various locale-specific oddball special cases, such as locale-specific trigraphs.

All of this multibyte madness goes back to the first standardization in 1990 - according to anecdotes from those who were part of that committee, this was because members from various European countries weren't able to use various symbols on their national keyboards.

(I'm not sure how widespread the AltGr key on such keyboards was at the time. It remains a key subject to some serious button mashing when writing C on non-English keyboards anyway, to get access to {}[] symbols etc.)

What does the security implications for default character set in mysqli_real_escape_string() means?

How SQL queries are parsed is dependent on the connection character set. If you did this query:

$value = chr(0xE0) . chr(0x5C);
mysql_query("SELECT '$value'");

then if the connection character set was Latin-1 MySQL would see the invalid:

SELECT 'à\'

whereas if the character set were Shift-JIS, the byte sequence 0xE0,0x5C would be interpreted as a double-byte character:

SELECT '濬'

Add string literal escaping for security:

$value = mysql_real_escape_string($value);
mysql_query("SELECT '$value'");

Now if you've correctly set the connection character set to Shift-JIS with mysql_set_charset, MySQL still sees:

SELECT '濬'

But if you haven't set the connection character set, and MySQL's default character set is Shift-JIS but PHP's default character set is ASCII, PHP doesn't know that the trailing 0x5C character is part of a double-byte sequence, and escapes it, thinking it is generating the valid output:

SELECT 'à\\'

whilst MySQL reads it using Shift-JIS as:

SELECT '濬\'

With the trailing ' escaped with a backslash, this has left the string literal open. The next ' character in the query will end the string, leaving whatever follows in raw SQL content. If you can inject there, the query is vulnerable.

This problem only applies to a few East Asian encodings like Shift-JIS where multibyte sequences can contain bytes which on their own are valid ASCII characters like the backslash. If the mismatched encodings both treat low bytes as always-ASCII (strict ASCII supersets like the more-common mismatch of Latin-1 vs UTF-8), no such confusion is possible.

Luckily servers which default to these encodings are uncommon, so in practice this is a rarely-exploitable issue. But if you have to use mysql_real_escape_string you should do it right. (Better to avoid it completely by using parameterised queries though.)

How do you troubleshoot character encoding problems?

Firstly, "ugly no-char boxes" might not be an encoding problem, they might just be a sign you don't have a font installed that can display the glyphs in the page.

Most character encoding problems happen when strings are being passed from one system to another. For webapps, this is usually between the browser and the application, between the application and the filesystem and between the application and the database.

So you need to check where the mis-encoded data is coming from, what character encoding it has at the source, and what encoding it is being received as. The best way is to send through characters you know the system is having problems with, and examine them at each level of the app. What do they look like inside the app? In the database? When you get them back from the database? When they're displayed in the browser?

Sorry to be so general, but the question doesn't give much more to work with.

Character set conversion problem - debug invalid characters - reverse engineer earlier conversions

I put an expanded version of my comment as answer:

Your viewer uses CP1252 (English and Western Europe, also called ANSI in Windows) or CP1250 (Eastern Europe) or an other similar code page. Most of characters are coded in the same manner, just few language specific changes. Your example do not includes character that are different on the two encoding, so I cannot say precisely.

That code pages are used on Microsoft Windows, and they are based (but not 100% compatible) with Latin-1, so it is common to see text interpreted with such encoding. MacOs and Linux are heavily (now) UTF-8 encoded. Windows uses Unicode internally (but UTF-16)

The old encoding is probably CP437: the standard code page in DOS, so it was used frequently also for CSV files. Other frequent old encoding are CP850 (Western Europe) and CP852 (Central Europe).

For the other answers you put in the comments, I think you should go to Superuser (if you are requesting tools (some editors allow you to specify the encoding. You may use the browser (opening a local file): browsers also allow you to choose the local encoding, and I think you may copy as Unicode [not sure], other tools sometime has hidden option to import files, but possibly not with all options), or as new question in this site, if you want to do it programmatically. But so you are required to specify the language. Python is well suited for such conversions (most scripting languages are created to handle texts): python has built in many encoding, you should just specify when reading and when writing the files. R also can be instructed on the input encoding.

How can character encoding be made correctly in both php and mysql database

Character set issues are often really tricky to figure out. Basically, you need to make sure that all of the following are true:

The DB connection is using UTF-8
The DB tables are using UTF-8
The individual columns in the DB tables are using UTF-8
The data is actually stored properly in the UTF-8 encoding inside the database (often not the case if you've imported from bad sources, or changed table or column collations)
The web page is requesting UTF-8
Apache is serving UTF-8

Here's a good tutorial on dealing with that list, from start to finish: https://web.archive.org/web/20110303024445/http://www.bluebox.net/news/2009/07/mysql_encoding/

It sounds like your problem is specifically that you've got double-encoded (or triple-encoded) characters, probably from changing character sets or importing already-encoded data with the wrong charset. There's a whole section on fixing that in the above tutorial.

How can I change calcite's default encode character set?

You should set the property calcite.default.charset to whatever character set you want to use. That said, I'm not sure this will solve all your problems. Support for other character sets is really a work in progress. See this discussion on the project mailing list.

Character Sets - Not Clear