What Are the Different Character Sets Used For

What are the different character sets used for?

Here is a break down of the different character sets used by the compiler itself (all reference to the standard are for C++14, actually):

  1. The physical source file characters are those used in the C++ source. Most likely these are now encoded using some Unicode encoding, e.g., UTF-8 or UTF-16. If you are from a European or an American background you may be using ASCII whose characters are conveniently encoded identically in UTF-8 (every ASCII file is a UTF-8 file but not the other way around). The physical source file characters_ may also be something unusual like EBCDIC.
  2. The basic source character set is what the compiler, at least conceptually, consumes. It is produced from the physical source file characters and either mapping them to their respective basic character or to a sequence of basic characters representing the physical source character using a universal character name (see 2.2 [lex.phases] paragraph 1). The basic source character set is a just a set of 96 character (2.3 [lex.charset] paragraph 1):

    a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 0 1 2 3 4 5 6 7 8 9 _ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ~ ! = , \ " ’

    and the 5 special characters space (' '), horizontal tab (\t), vertical tab (\v), form feed (\f), and newline (\n)

    The mapping between the physical source character set and the basic character set is implementation defined.

  3. The basic execution character set and the basic execution wide-character set are characters set capable of representing the basic source character set expanded by a few special character:

    alert ('\a'), backspace ('\b'), carriage return ('\r'), and a null character ('\0')

    The difference between the non-wide and the wide version is whether the characters are represented using char or wchar_t.

  4. The execution character set and the execution wide-character set are implementation defined extensions of the basic character set and the basic wide-character set. In 2.3 [lex.charset] paragraph 3 it is stated that the additional members and the values of the additional members of execution character set are locale specific. It isn't clear which locale is referred to but I suspect the locale used during compilation is meant. In any case, the execution character sets are implementation defined (also according to 2.3 [lex.charset] paragraph 3).

  5. Character and string literals are originally represented using the basic source character set with some characters possibly using universal character names. All of these are converted at compile time into the execution character set. According to 2.14.3 [lex.ccon] character literals representable as one char in the execution character set just work. If multiple chars are needed the character literals may be conditionally supported (and they'd have type int). For string literals the conversion is described in 2.14.5 [lex.string]. Paragraph 9 states that UTF-8 string literals (e.g. u8"hello") result in a sequence of values corresponding to the code units of the UTF-8 string. Otherwise the translation of characters and universal character names is the same as that for character literals (in particular, it is implementation defined) although characters resulting in multi-byte sequences for narrow string just result in multiple characters (this case isn't necessary support for character literals).

So far, only the result of compilation is considered. Any character which isn't part of a character or a string literal is used to specify what the code does. The interesting question is what happened to the literals? The literals are all basically translated into an implementation defined representation. That is implementation defined means that it is somewhere documented what is supposed to happen but it can differ between different implementations.

How does that help when dealing with characters or strings coming from somewhere? Well, any character or string which is read is converted to the corresponding execution character set. In particular, when a file is read, all characters are transformed to this common representation. Of course, for this transformation to work, the locale used for reading a file needs to be setup according to the encoding of that file. If the locale isn't explicitly mentioned the global locale is used which is initially determined by the system is used. The initial global locale is probably set somehow based on user preferences, e.g., based no environment variables. If a a file is read which uses a different encoding than this global locale, a corresponding different locale matching the encoding of the file needs to be used.

Correspondingly, when writing characters using one of the execution character sets, these are converted according to the encoding specified by the current locale. Again, it may be necessary to replace the locale if a specific encoding is needed.

All this effectively means that internally to a program all string and character processing happens using the implementation defined execution character set. All characters being read by a program need to be converted to this character set and all characters written start as characters in this execution character set and need to be converted appropriately to the external encoding. If course, in an ideal set up the conversion between the execution character set and the external representation is the identity, e.g., because the execution character set uses UTF-8 and the external representation also uses UTF-8. Correspondingly for the execution wide-character set except in this case UTF-16 would be used (one of the two variations as UTF-16 can either use big endian or little endian representation).

What does character set and collation mean exactly?

From MySQL docs:

A character set is a set of symbols
and encodings. A collation is a set of
rules for comparing characters in a
character set. Let's make the
distinction clear with an example of
an imaginary character set.

Suppose that we have an alphabet with
four letters: 'A', 'B', 'a', 'b'. We
give each letter a number: 'A' = 0,
'B' = 1, 'a' = 2, 'b' = 3. The letter
'A' is a symbol, the number 0 is the
encoding for 'A', and the combination
of all four letters and their
encodings is a character set.

Now, suppose that we want to compare
two string values, 'A' and 'B'. The
simplest way to do this is to look at
the encodings: 0 for 'A' and 1 for
'B'. Because 0 is less than 1, we say
'A' is less than 'B'. Now, what we've
just done is apply a collation to our
character set. The collation is a set
of rules (only one rule in this case):
"compare the encodings." We call this
simplest of all possible collations a
binary collation.

But what if we want to say that the
lowercase and uppercase letters are
equivalent? Then we would have at
least two rules: (1) treat the
lowercase letters 'a' and 'b' as
equivalent to 'A' and 'B'; (2) then
compare the encodings. We call this a
case-insensitive collation. It's a
little more complex than a binary
collation.

In real life, most character sets have
many characters: not just 'A' and 'B'
but whole alphabets, sometimes
multiple alphabets or eastern writing
systems with thousands of characters,
along with many special symbols and
punctuation marks. Also in real life,
most collations have many rules: not
just case insensitivity but also
accent insensitivity (an "accent" is a
mark attached to a character as in
German 'ö') and multiple-character
mappings (such as the rule that 'ö' =
'OE' in one of the two German
collations).

What is the difference between charsets and character encoding

Charset is synonym for character encoding

Default encoding depends on the operating system and locale.

EDIT
http://www.w3.org/TR/REC-xml/#sec-TextDecl

http://www.w3.org/TR/REC-xml/#NT-EncodingDecl

What does be representable in execution character set mean?

The default execution character set of GCC is UTF-8.

And therein lies the problem. Namely, this is not true. Or at least, not in the way that the C++ standard means it.

The standard defines the "basic character set" as a collection of 96 different characters. However, it does not define an encoding for them. That is, the character "A" is part of the "basic character set". But the value of that character is not specified.

When the standard defines the "basic execution character set", it adds some characters to the basic set, but it also defines that there is a mapping from a character to a value. Outside of the NUL character being 0 however (and that the digits have to be encoded in a contiguous sequence), it lets implementations decide for themselves what that mapping is.

Here's the issue: UTF-8 is not a "character set" by any reasonable definition of that term.

Unicode is a character set; it defines a series of characters which exist and what their meanings are. It also each character in the Unicode character set a unique numeric value (a Unicode codepoint).

UTF-8 is... not that. UTF-8 is a scheme for encoding characters, typically in the Unicode character set (though it's not picky; it can work for any 21-bit number, and it can be extended to 32-bits).

So when GCC's documentation says:

[The execution character set] is under control of the user; the default is UTF-8, matching the source character set.

This statement makes no sense, since as previously stated, UTF-8 is a text encoding, not a character set.

What seems to have happened to GCC's documentation (and likely GCC's command line options) is that they've conflated the concept of "execution character set" with "narrow character encoding scheme". UTF-8 is how GCC encodes narrow character strings by default. But that's different from saying what its "execution character set" is.

That is, you can use UTF-8 to encode just the basic execution character set defined by C++. Using UTF-8 as your narrow character encoding scheme has no bearing on what your execution character set is.

Note that Visual Studio has a similarly-named option and makes a similar conflation of the two concepts. They call it the "execution character set", but they explain that the behavior of the option as:

The execution character set is the encoding used for the text of your program that is input to the compilation phase after all preprocessing steps.

So... what is GCC's execution character set? Well, since their documentation has confused "execution character set" with "narrow string encoding", it's pretty much impossible to know.

So what does the standard require out of GCC's behavior? Well, take the rule you quoted and turn it around. A single universal-character-name in a character literal will either be a char or an int, and it will only be the latter if the universal-character-name names a character not in the execution character set. So it's impossible for a system's execution character set to include more characters than char has bits to allow them.

That is, GCC's execution character set cannot be Unicode in its entirety. It must be some subset of Unicode. It can choose for it to be the subset of Unicode whose UTF-8 encoding takes up 1 char, but that's about as big as it can be.


While I've framed this as GCC's problem, it's also technically a problem in the C++ specification. The paragraph you quoted also conflates the encoding mechanism (ie: what char means) with the execution character set (ie: what characters are available to be stored).

This problem has been recognized and addressed by the addition of this wording:

A non-encodable character literal is a character-literal whose c-char-sequence consists of a single c-char that is not a numeric-escape-sequence and that specifies a character that either lacks representation in the literal's associated character encoding or that cannot be encoded as a single code unit. A multicharacter literal is a character-literal whose c-char-sequence consists of more than one c-char. The encoding-prefix of a non-encodable character literal or a multicharacter literal shall be absent or L. Such character-literals are conditionally-supported.

As these are proposed (and accepted) as resolutions for CWG issues, they also retroactively apply to previous versions of the standard.

What is character encoding and why should I bother with it

(Note that I'm using some of these terms loosely/colloquially for a simpler explanation that still hits the key points.)

A byte can only have 256 distinct values, being 8 bits.

Since there are character sets with more than 256 characters in the character set one cannot in general simply say that each character is a byte.

Therefore, there must be mappings that describe how to turn each character in a character set into a sequence of bytes. Some characters might be mapped to a single byte but others will have to be mapped to multiple bytes.

Those mappings are encodings, because they are telling you how to encode characters into sequences of bytes.

As for Unicode, at a very high level, Unicode is an attempt to assign a single, unique number to every character. Obviously that number has to be something wider than a byte since there are more than 256 characters :) Java uses a version of Unicode where every character is assigned a 16-bit value (and this is why Java characters are 16 bits wide and have integer values from 0 to 65535). When you get the byte representation of a Java character, you have to tell the JVM the encoding you want to use so it will know how to choose the byte sequence for the character.

How to programatically identify the character set of a file?

It is practically impossible to identify arbitrary character sets just by looking at a raw byte dump. Some character sets show typical patterns by which they can be identified, but that still doesn't make a clear match. The best you can do is typically to guess by exclusion, starting with character sets that have certain rules. If a file is not valid in UTF-8, then try Shift-JIS, then BIG-5 etc... The problem is that any file is valid in Latin-1 and other single byte encodings. That's what makes it so fundamentally impossible. It's also virtually impossible to distinguish any one single-byte charset from any other single-byte charset. In the end you'd have to employ text analysis to determine whether a decoded piece of text appears to make sense or whether it looks like gibberish and hence the encoding was likely incorrect.

In short: there's no foolproof way to detect character sets, period. You should always have metadata which specifies the charset.

Character sets - Not clear

You need do distinguish between the source character set, the execution character set, the wire execution character set and it's basic versions:

The basic source character set:

§2.1.1: The basic source character set consists of 96 characters […]

This character set has exactly 96 characters. They fit into 7 bit. Characters like @ are not included.

Let's get some example binary representations for a few basic source characters. They can be completely arbitrary and there is no need these correspond to ASCII values.

A -> 0000000
B -> 0100100
C -> 0011101

The basic execution character set …

§2.1.3: The basic execution character set and the basic execution wide-character set shall each contain all the members of the basic source character set, plus control characters representing alert, backspace, and carriage return, plus a null character (respectively, null wide character), whose representation has all zero bits.

As stated the basic execution character set contains all members of basic source character set. It still doesn't include any other character like @. The basic execution character set can have a different binary representation.

As stated the basic execution character set contains representations for carriage return, a null character and other characters.

A          -> 10110101010
B -> 00001000101 <- basic source character set
C -> 10101011111
----------------------------------------------------------
null -> 00000000000
Backspace -> 11111100011

If the basic execution character set is 11 bits long (like in this example) the char data type shall be large enough to store 11 bits but it may be longer.

… and The basic execution wide character set:

The basic execution wide character is used for wide characters (wchar_t). It basicallly the same as the basic execution wide character set but can have different binary representations as well.

A          -> 1011010101010110101010
B -> 0000100010110101011111 <- basic source character set
C -> 1010100101101000011011
---------------------------------------------------------------------
null -> 0000000000000000000000
Backspace -> 1111110001100000000001

The only fixed member is the null character which needs to be a sequence of 0 bits.

Converting between basic character sets:

§2.1.1.5: Each source character set member, escape sequence, or universal-character-name in character literals and string literals is converted to a member of the execution character set (2.13.2, 2.13.4).

Then a c++ source file is compiled each character of the source character set is converted into the basic execution (wide) character set.

Example:

const char* string0   =  "BA\bC";
const wchar_t string1 = L"BA\bC";

Since string0 is a normal character it will be converted to the basic execution character set and string1 will be converted to the basic execution wide character set.

string0 -> 00001000101 10110101010 11111100011 10101011111
string1 -> 0000100010110101011111 1011010101010110101010 // continued
1111110001100000000001 1010100101101000011011

Something about file encodings:

There are several kind of file encodings. For example ASCII which is 7 bit long. Windows-1252 which is 8 bit long (known as ANSI).
ASCII doesn't contain non-English characters. ANSI contains some European characters like ä Ö ä Õ ø.

Newer file encodings like UTF-8 or UTF-32 can contain characters of any language. UTF-8 is characters are variable in length. UTF-32 are 32 bit characters long.

File enconding requirements:

Most compilers offer command line switch to specify the file encoding of the source file.

A c++ source file needs to be encoded in an file encoding which has a representation of the basic source character set. For example: The file encoding of the source file needs to have a representation of the ; character.

If you can type the character ; within the encoding chosen as the encoding of the source file that encoding is not suitable as a c++ source file encoding.

Non-basic character sets:

Characters not included in the basic source character set belong to the source character set. The source character set is equivalent to the file encoding.

For example: the @ character is not include in the basic source character but it may be included in the source character set. The chosen file encoding of the input source file might contain a representation of @. If it doesn't contain a representation for @ you can't use the character @ within strings.

Characters not included in the basic (wide) character set belong to the execution (wide) character set.

Remember that the compiler converts the character from the source character set to the execution character set and the execution wide character set. Therefore there needs to be way how these characters can be converted.

For example: If you specify Windows-1252 as the encoding of the source character set and specify ASCII as the execution wide character set there is no way to convert this string:

const char* string0 = "string with European characters ö, Ä, ô, Ð.";

These characters can not be represented in ASCII.

Specifying character sets:

Here are some examples how to specify the character sets using gcc. The default values are included.

-finput-charset=UTF-8         <- source character set
-fexec-charset=UTF-8 <- execution character set
-fwide-exec-charset=UTF-32 <- execution wide character set

With UTF-8 and UTF-32 as default encoding c++ source files can contain strings with character of any language. UTF-8 characters can the converted both ways without problems.

The extended character set:

§1.1.3: multibyte character, a sequence of one or more bytes representing a member of the extended character set of either the source or the execution environment. The extended character set is a superset of the basic character set (2.2).

Multibyte character are longer than an entry of the normal characters. They contain an escape sequence marking them as multibyte character.

Multibyte characters are processed according the locale set in the user's runtime environment. These multibyte characters are converted at runtime to the encoding set in user's environment.



Related Topics



Leave a reply



Submit