Why Do C++ Streams Use Char Instead of Unsigned Char

Why do C++ streams use char instead of unsigned char?

Possibly I've misunderstood the question, but conversion from unsigned char to char isn't unspecified, it's implementation-dependent (4.7-3 in the C++ standard).

The type of a 1-byte character in C++ is "char", not "unsigned char". This gives implementations a bit more freedom to do the best thing on the platform (for example, the standards body may have believed that there exist CPUs where signed byte arithmetic is faster than unsigned byte arithmetic, although that's speculation on my part). Also for compatibility with C. The result of removing this kind of existential uncertainty from C++ is C# ;-)

Given that the "char" type exists, I think it makes sense for the usual streams to use it even though its signedness isn't defined. So maybe your question is answered by the answer to, "why didn't C++ just define char to be unsigned?"

Is it better to use char or unsigned char array for storing raw data?

UPDATE: C++17 introduced std::byte, which is more suited to "raw" data buffers than using any manner of char.

For earlier C++ versions:

  • unsigned char emphasises that the data is not "just" text

  • if you've got what's effectively "byte" data from e.g. a compressed stream, a database table backup file, an executable image, a jpeg... then unsigned is appropriate for the binary-data connotation mentioned above

    • unsigned works better for some of the operations you might want to do on binary data, e.g. there are undefined and implementation defined behaviours for some bit operations on signed types, and unsigned values can be used directly as indices in arrays

    • you can't accidentally pass an unsigned char* to a function expecting char* and have it operated on as presumed text

    • in these situations it's usually more natural to think of the values as being in the range 0..255, after all - why should the "sign" bit have a different kind of significance to the other bits in the data?

  • if you're storing "raw data" that - at an application logic/design level happens to be 8-bit numeric data, then by all means choose either unsigned or explicitly signed char as appropriate to your needs

Does a byte oriented FILE stream contain `char`s or `unsigned char`s?

It doesn't really matter. The standard use unsigned char at some chosen place because it allows precise formulation at those places:

  • fgetc is specified to return a unsigned char converted to an int so that one knows that the result is positive or null excepted when it is EOF (and thus there is no confusion possible between EOF and a valid char, confusion which is cause of bugs when one store directly the result of fgetc in a char without checking for EOF beforehand).

  • fputc is specified to take an int and convert it to an unsigned char because this conversion is well specified. If you aren't careful, formulation not using unsigned char could make UB a sequence like

    int c = fgetc(stdin);
    if (c != EOF)
    fputc(c, stdout);

with signed char for negative chars.

C/C++ Why to use unsigned char for binary data?

In C the unsigned char data type is the only data type that has all the following three properties simultaneously

  • it has no padding bits, that it where all storage bits contribute to the value of the data
  • no bitwise operation starting from a value of that type, when converted back into that type, can produce overflow, trap representations or undefined behavior
  • it may alias other data types without violating the "aliasing rules", that is that access to the same data through a pointer that is typed differently will be guaranteed to see all modifications

if these are the properties of a "binary" data type you are looking for, you definitively should use unsigned char.

For the second property we need a type that is unsigned. For these all conversion are defined with modulo arihmetic, here modulo UCHAR_MAX+1, 256 in most 99% of the architectures. All conversion of wider values to unsigned char thereby just corresponds to truncation to the least significant byte.

The two other character types generally don't work the same. signed char is signed, anyhow, so conversion of values that don't fit it is not well defined. char is not fixed to be signed or unsigned, but on a particular platform to which your code is ported it might be signed even it is unsigned on yours.

Implement `memcpy()`: Is `unsigned char *` needed, or just `char *`?

In theory, your code might run on a machine which forbids one bit pattern in a signed char. It might use ones' complement or sign-magnitude representations of negative integers, in which one bit pattern would be interpreted as a 0 with a negative sign. Even on two's-complement architectures, the standard allows the implementation to restrict the range of negative integers so that INT_MIN == -INT_MAX, although I don't know of any actual machine which does that.

So, according to §6.2.6.2p2, there may be one signed character value which an implementation might treat as a trap representation:

Which of these [representations of negative integers] applies is implementation-defined, as is whether the value with sign bit 1 and all value bits zero (for the first two [sign-magnitude and two's complement]), or with sign bit and all value bits 1 (for ones' complement), is a trap representation or a normal value. In the case of sign and magnitude and ones’ complement, if this representation is a normal value it is called a negative zero.

(There cannot be any other trap values for character types, because §6.2.6.2 requires that signed char not have any padding bits, which is the only other way that a trap representation can be formed. For the same reason, no bit pattern is a trap representation for unsigned char.)

So, if this hypothetical machine has a C implementation in which char is signed, then it is possible that copying an arbitrary byte through a char will involve copying a trap representation.

For signed integer types other than char (if it happens to be signed) and signed char, reading a value which is a trap representation is undefined behaviour. But §6.2.6.1/5 allows reading and writing these values for character types only:

Certain object representations need not represent a value of the object type. If the stored value of an object has such a representation and is read by an lvalue expression that does not have character type, the behavior is undefined. If such a representation is produced by a side effect that modifies all or any part of the object by an lvalue expression that does not have character type, the behavior is undefined. Such a representation is called a trap representation. (Emphasis added)

(The third sentence is a bit clunky, but to simplify: storing a value into memory is a "side effect that modifies all of the object", so it's permitted as well.)

In short, thanks to that exception, you can use char in an implementation of memcpy without worrying about undefined behaviour.

However, the same is not true of strcpy. strcpy must check for the trailing NUL byte which terminates a string, which means it needs to compare the value it reads from memory with 0. And the comparison operators (indeed, all arithmetic operators) first perform integer promotion on their operands, which will convert the char to an int. Integer promotion of a trap representation is undefined behaviour, as far as I know, so on the hypothetical C implementation running on the hypothetical machine, you would need to use unsigned char in order to implement strcpy.

Why use unsigned chars for writing to binary files? And why shouldn't stream operators be used to write to binary files?

chars are the smallest type in C/C++ (by definition, sizeof( char ) == 1). Its the usual way to see objects as a sequence of bytes. unsigned is used to avoid signed arithmethic to get in the way, and because it best represents binary contents (a value between 0 and 255).

To operate on binary files, streams provide the read and write functions. The insertion and extraction functionality is formatted. It's working for you just by chance, for instance if you output an integer with << then it will actually output the textual representation of the integer value and not its binary representation. In your provided example, you cast a float to an unsigned char before outputing, actually casting the real value to a small integer. What do you get when you try to read the float back from the file?

Why is 'char' signed by default in C++?

It isn't.

The signedness of a char that isn't either a signed char or unsigned char is implementation-defined. Many systems make it signed to match other types that are signed by default (like int), but it may be unsigned on some systems. (Say, if you pass -funsigned-char to GCC.)



Related Topics



Leave a reply



Submit