Multicharacter Literal in C and C++

Multicharacter literal in C and C++

I don't know how extensively this is used, but "implementation-defined" is a big red-flag to me. As far as I know, this could mean that the implementation could choose to ignore your character designations and just assign normal incrementing values if it wanted. It may do something "nicer", but you can't rely on that behavior across compilers (or even compiler versions). At least "goto" has predictable (if undesirable) behavior...

That's my 2c, anyway.

Edit: on "implementation-defined":

From Bjarne Stroustrup's C++ Glossary:

implementation defined - an aspect of C++'s semantics that is defined for each implementation rather than specified in the standard for every implementation. An example is the size of an int (which must be at least 16 bits but can be longer). Avoid implementation defined behavior whenever possible. See also: undefined. TC++PL C.2.

also...

undefined - an aspect of C++'s semantics for which no reasonable behavior is required. An example is dereferencing a pointer with the value zero. Avoid undefined behavior. See also: implementation defined. TC++PL C.2.

I believe this means the comment is correct: it should at least compile, although anything beyond that is not specified. Note the advice in the definition, also.

Does C++ allow 8 byte long multi-character literals?

Yes, as long as your compiler has 8-byte ints and supports it.

The C++ standard is farily terse regarding multi-character literals. This is all it has to say on the matter (C++14, 2.14.3/1):

An
ordinary character literal that contains more than one c-char is a multicharacter literal. A multicharacter
literal, or an ordinary character literal containing a single c-char not representable in the execution character
set, is conditionally-supported, has type int, and has an implementation-defined value.

(Emphasis mine)

As you see, pretty much all the standard says is that if multicharacter literals are supported (they don't have to be), they are of type int. The value is up to the compiler.

How to change a string to a multiCharacter literal in C++?

You could convert the string data to an integer during runtime.

Knowing that the char type can be treated as an integer, it can therefore be placed into an integer.

If we assume a platform with 8 bits per character, the default char type is unsigned and an unsigned int is 32-bits, we can create an integer from the characters:

char text[] = "abcd";
unsigned int value;
value = text[0] << 24
      + text[1] << 16
      + text[2] << 8
      + text[3];

There may be more problems involved if you want to convert a string literal into a multibyte literal at compile time.

Is a 64-bit character literal possible in C?

Are 64-bit character literals not a feature of C?

Indeed they are not. As per C99 §6.4.4.4 point 10 (page 73 here):

An integer character constant has type int. The value of an integer character constant
containing a single character that maps to a single-byte execution character is the
numerical value of the representation of the mapped character interpreted as an integer.
The value of an integer character constant containing more than one character (e.g.,
'ab'), or containing a character or escape sequence that does not map to a single-byte
execution character, is implementation-defined.

So, character constants have type int, which on most modern platforms means int32_t. On the other hand, the actual value of the int resulting from a multi-byte character constant is implementation defined, so you can't really expect much from int x = 'abc';, unless you are targeting a specific compiler and compiler version. You should avoid using such statements in sane C code.

As per GCC-specific behavior, from the GCC documentation we have:

The numeric value of character constants in preprocessor expressions.
The preprocessor and compiler interpret character constants in the same way; i.e. escape sequences such as ‘\a’ are given the values they would have on the target machine.
The compiler evaluates a multi-character character constant a character at a time, shifting the previous value left by the number of bits per target character, and then or-ing in the bit-pattern of the new character truncated to the width of a target character. The final bit-pattern is given type int, and is therefore signed, regardless of whether single characters are signed or not. If there are more characters in the constant than would fit in the target int the compiler issues a warning, and the excess leading characters are ignored.
For example, 'ab' for a target with an 8-bit char would be interpreted as ‘(int) ((unsigned char) 'a' * 256 + (unsigned char) 'b')’, and '\234a' as ‘(int) ((unsigned char) '\234' * 256 + (unsigned char) 'a')’.

multicharacter literal misunderstanding

This appears to be a known MSVC compiler 'peculiarity'.

The C++ standard n3797 S2.14.3/1 says:

A multicharacter literal, or an ordinary character literal containing a single c-char not representable in the execution character set, is conditionally-supported, has type int, and has an implementation-deﬁned value.

So MSVC can certainly do this and claim it is 'implementation-defined' and not a bug.

If this was my call, I would probably say 'do not fix'. The risk of breaking existing code is far higher than the benefit of doing anything useful, and is easily dealt with by interesting question and answer on Stack Overflow.

Ref: see http://www.tech-archive.net/Archive/VC/microsoft.public.vc.language/2004-09/0079.html.

If you wish to reliably assemble equivalent values you have two choices, which produce opposite results depending on endianism.

You can use arithmetic operations (shift and mask) to produce an integer value:

 '\'' | ('/' << 8) | ('>' << 16) | ('\x20' << 24)

Or you can use string and cast operations to produce a string-like integer value:

*(int*)"\"/>\x20"

As per a comment, depending on how it is written this last technique can lead to generation of bad code. The string has to go somewhere (at run-time) and it will be null-terminated. The main justification is that it can avoid the need for endian-sensitive #defines and pre-processing.

See also this question: How to write a compile-time initialisation of a 4 byte character constant that is fully portable

Multicharacter Literal in C and C++