What Do the C and C++ Standards Say About Bit-Level Integer Representation and Manipulation

What do the C and C++ standards say about bit-level integer representation and manipulation?

(1) If all the bits in an integer type are zero, does the integer as whole represent zero?

Yes, the bit pattern consisting of all zeroes always represents 0:

The representations of integral types shall define values by use of a pure binary numeration system.⁴⁹ [§3.9.1/7]

⁴⁹ A positional representation for integers that uses the binary digits 0 and 1, in which the values represented by successive bits are additive, begin with 1, and are multiplied by successive integral power of 2, except perhaps for the bit with the highest position.

(2) If any bit in an integer type is one, does the integer as a whole represent non-zero? (if this is a "yes" then some representations like sign-and-magnitude would be additionally restricted)

No. In fact, signed magnitude is specifically allowed:

[ Example: this International Standard permits 2’s complement, 1’s complement and signed magnitude representations for integral types. —end
example ] [§3.9.1/7]

(3) Is there a guaranteed way to check if any bit is not set?

I believe the answer to this is "no," if you consider signed types. It is equivalent to equality testing with a bit pattern of all ones, which is only possible if you have a way to produce a signed number with bit pattern of all ones. For an unsigned number this representation is guaranteed, but casting from unsigned to signed is undefined if the number is unrepresentable:

If the destination type is signed, the value is unchanged if it can be represented in the destination type (and bit-field width); otherwise, the value is implementation-defined. [§4.7/3]

(4) Is there a guaranteed way to check if any bit is set?

I don't think so, because signed magnitude is allowed—0 would compare equal to −0. But it should be possible with unsigned numbers.

(5) Is there a guaranteed way to set the left-most and/or right-most bits?

Again, I believe the answer is "yes" for unsigned numbers, but "no" for signed numbers. Shifts are undefined for negative signed numbers:

Otherwise, if E1 has a signed type and non-negative value, and E1 × 2^E2 is representable in the result type, then that is the resulting value; otherwise, the behavior is undefined. [§5.8/2]

What part of integer bit-representation is part of The Standard?

The C standard discusses the representations of integer types in section 6.2.6.2.

It specifies a binary representation for integer types. For unsigned types, the bits are divided into value bits and padding bits. Padding bits do not contribute to the value; there needn't be any padding bits. For signed types, there is a single sign bit. Signed types may be represented using either sign and magnitude, two's complement, or one's complement (two's complement is nearly universal for modern systems).

The ordering of the bits, and the presence and number of padding bits, are implementation-defined. (Most modern implementations do not have padding bits).

The concept of padding bits, and the restriction to the three canonical representations, were introduced in C99.

The bitwise operators (<<, >>, &, et al) are defined in terms of the bits making up the representation of the values of the operands, but the requirements on the representations are specific enough that this is unambiguous for most cases. The description of the << and >> shift operators specifically says that, for example, the result of E1 << E2 is E1 × 2^E2; see section 6.5.7 of the cited N1570 draft.

The C++ standard has a non-normative note that says it permits 2’s complement, 1’s complement and signed magnitude representations for integral types, but there doesn't seem to be an explicit statement that no other representations are permitted. It does require that "The representations of integral types shall define values by use of a pure binary numeration system.". You can see the gory details in the N4296 working draft of the C++ standard (or in any other draft, or in the standard itself if you have a copy).

How to do a bit representation in a C-standard way?

In general, it's not that hard to accommodate unusual platforms for the most cases (if you don't want to simply assume 8-bit char, 2's complement, no padding, no trap, and truncating unsigned-to-signed conversion), the standard mostly gives enough guarantees (a few macros to inspect certain implementation details would be helpful, though).

As far as a strictly conforming program can observe (outside bit-fields), 5 is always encoded as 00...0101. This is not necessarily the physical representation (whatever this should mean), but what is observable by portable code. A machine using Gray code internally, for example, would have to emulate a "pure binary notation" for bitwise operators and shifts.

For negative values of signed types, different encodings are allowed, which leads to different (but well-defined for every case) results when re-interpreting as the corresponding unsigned type. For example, strictly conforming code must distinguish between (unsigned)n and *(unsigned *)&n for a signed integer n: They are equal for two's complement without padding bits, but different for the other encodings if n is negative.

Further, padding bits may exist, and signed integer types may have more padding bits than their corresponding unsigned counterparts (but not the other way round, type-punning from signed to unsigned is always valid). sizeof cannot be used to get the number of non-padding bits, so e.g. to get an unsigned value where only the sign-bit (of the corresponding signed type) is set, something like this must be used:

#define TYPE_PUN(to, from, x) ( *(to *)&(from){(x)} )
unsigned sign_bit = TYPE_PUN(unsigned, int, INT_MIN) &
                    TYPE_PUN(unsigned, int, -1) & ~1u;

(there are probably nicer ways) instead of

unsigned sign_bit = 1u << sizeof sign_bit * CHAR_BIT - 1;

as this may shift by more than the width. (I don't know of a constant expression giving the width, but sign_bit from above can be right-shifted until it's 0 to determine it, Gcc can constant-fold that.) Padding bits can be inspected by memcpying into an unsigned char array, though they may appear to "wobble": Reading the same padding bit twice may give different results.

If you want the bit pattern (without padding bits) of a signed integer (little endian):

int print_bits_u(unsigned n) {
    for(; n; n>>=1) {
        putchar(n&1 ? '1' : '0'); // n&1 never traps
    }
    return 0;
}

int print_bits(int n) {
    return print_bits_u(*(unsigned *)&n & INT_MAX);
    /* This masks padding bits if int has more of them than unsigned int.
     * Note that INT_MAX is promoted to unsigned int here. */
}

int print_bits_2scomp(int n) {
    return print_bits_u(n);
}

print_bits gives different results for negative numbers depending on the representation used (it gives the raw bit pattern), print_bits_2scomp gives the two's complement representation (possibly with a greater width than a signed int has, if unsigned int has less padding bits).

Care must be taken not to generate trap representations when using bitwise operators and when type-punning from unsigned to signed, see below how these can potentially be generated (as an example, *(int *)&sign_bit can trap with two's complement, and -1 | 1 can trap with ones' complement).

Unsigned-to-signed integer conversion (if the converted value isn't representable in the target type) is always implementation-defined, I would expect non-2's complement machines to differ from the common definition more likely, though technically, it could also become an issue on 2's complement implementations.

From C11 (n1570) 6.2.6.2:

(1) For unsigned integer types other than unsigned char, the bits of the object representation shall be divided into two groups: value bits and padding bits (there need not be any of the latter). If there are N value bits, each bit shall represent a different power of 2 between 1 and 2^N-1, so that objects of that type shall be capable of representing values from 0 to 2^N-1 using a pure binary representation; this shall be known as the value representation. The values of any padding bits are unspecified.

(2) For signed integer types, the bits of the object representation shall be divided into three groups: value bits, padding bits, and the sign bit. There need not be any padding bits; signed char shall not have any padding bits. There shall be exactly one sign bit. Each bit that is a value bit shall have the same value as the same bit in the object representation of the corresponding unsigned type (if there are M value bits in the signed
type and N in the unsigned type, then M≤N ). If the sign bit is zero, it shall not affect the resulting value. If the sign bit is one, the value shall be modified in one of the following ways:

the corresponding value with sign bit 0 is negated (sign and magnitude);

the sign bit has the value -(2^M) (two's complement);

the sign bit has the value -(2^M-1) (ones' complement).

Which of these applies is implementation-defined, as is whether the value with sign bit 1 and all value bits zero (for the first two), or with sign bit and all value bits 1 (for ones' complement), is a trap representation or a normal value. In the case of sign and magnitude and ones' complement, if this representation is a normal value it is called a negative zero.

Does the C++ standard require the maximum of unsigned integer numbers to be of the form 2^N-1?

C++ specifies the ranges of the integral types by reference to the C standard. The C standard says:

For unsigned integer types other than unsigned char, the bits of the object representation shall be divided into two groups: value bits and padding bits (there need not be any of the latter). If there are N value bits, each bit shall represent a different power of 2 between 1 and 2^N − 1, so that objects of that type shall be capable of
representing values from 0 to 2^N − 1 using a pure binary representation; this shall be known as the value representation. The values of any padding bits are unspecified.

Moreover, C++ requires:

Unsigned integers shall obey the laws of arithmetic modulo 2ⁿ where n is the number of bits in the value representation of that particular size of integer.

Putting all this together, we find that an unsigned integral type has n value bits, represents the values in the range [0, 2ⁿ) and obeys the laws of arithmetic modulo 2ⁿ.

Are the results of bitwise operations on signed integers defined?

For negative operands, << has undefined behavior and the result of >> is implementation-defined (usually as "arithmetic" right shift). << and >> are conceptually not bitwise operators. They're arithmetic operators equivalent to multiplication or division by the appropriate power of two for the operands on which they're well-defined.

As for the genuine bitwise operators ^, ~, |, and &, they operate on the bit representation of the value in the (possibly promoted) type of the operand. Their results are well-defined for each possible choice of signed representation (twos complement, ones complement, or sign-magnitude) but in the latter two cases it's possible that the result will be a trap representation if the implementation treats the "negative zero" representation as a trap. Personally, I almost always use unsigned expressions with bitwise operators so that the result is 100% well-defined in terms of values rather than representations.

Finally, note that this answer as written may only apply to C. C and C++ are very different languages and while I don't know C++ well, I understand it may differ in some of these areas from C...

Bitfield manipulation in C

Bitfields are not quite as portable as you think, as "C gives no guarantee of the ordering of fields within machine words" (The C book)

Ignoring that, used correctly, either method is safe. Both methods also allow symbolic access to integral variables. You can argue that the bitfield method is easier to write, but it also means more code to review.

What is the value of ~0 in C?

Use:

~0U >> 1

Suffix 'U' for unsigned shift behavior.

so, confused that why not ~0 turns out to be 0xffffffff?

See, what is 0 say in four bytes representation:

BIT NUMBER    31                                     0
               ▼                                     ▼
number bits    0000 0000 0000 0000 0000 0000 0000 0000 
               ▲                                     ▲ 
              MSB                                   LSB

LSB - Least Significant Bit (numbered 0)
MSB - Most  Significant Bit (numbered 31)

Now ~ is bitwise not operator then flips all bits in 0 as:

BIT NUMBER    31                                     0
               ▼                                     ▼
number bits    1111 1111 1111 1111 1111 1111 1111 1111
               ▲                                     ▲ 
              MSB                                   LSB

Because of MSB = 1 this representation is treated as negative number and its magnitude is find using 2'complement math that is -1.

How?

What is 1 ? it is:

number bits    0000 0000 0000 0000 0000 0000 0000 0001 
               ▲                                     ▲ 
              MSB                                   LSB

1's complement of 1

number bits    1111 1111 1111 1111 1111 1111 1111 1110
               ▲                                     ▲ 
              MSB                                   LSB

2'complement? Add 1 in one's complement, that is:

number bits    1111 1111 1111 1111 1111 1111 1111 1111
               ▲                                     ▲ 
              MSB                                   LSB

this same as when you gets ~0 ? that is why you are getting -1 output.

Now >> shift operator?

In most implementation of C >> operator is defined as an arithmetic right shift, which preserves the sign bit MSB. So ~0 >> 1 is noting but -1 remains same.

6.5.7 [Bitwise shift operators]

5 The result of E1 >> E2 is E1 right-shifted E2 bit positions. If E1 has an unsigned type or if E1 has a signed type and a nonnegative value, the value of the result is the integral part of the quotient of E1 / 2E2. If E1 has a signed type and a negative value, the resulting value is implementation-defined.

You requirement is what is called unsigned right shift >> and the behavior you needed can be find using unsigned number that is why I suffixed U as 0U.

How to print INT_MIN and INT_MAX?

Because printing INT_MIN and INT_MAX is bit tricky(because of undefined and implementation behavior of setting MSB and bit-overflow) in C so I have written a code as follows:

#include <stdio.h>
#include<limits.h> /* include for CHAR_BIT */
int main(){
  int my_int_min = 1U << ((sizeof(int) * CHAR_BIT) - 1);
  int my_int_max = ~0U >> 1;
  printf("INT_MIN  = %d\n", my_int_min);
  printf("INT_MAX  = %d\n", my_int_max);
  return 0;
}

See it executing @codepad, it output is:

INT_MIN  = -2147483648
INT_MAX  = 2147483647

How does this code work?

Note for 32-bit number range is [-2147483648, 2147483647] that is equals to [-2³¹, 2³¹ -1 ].

INT_MIN: -2³¹ == -2147483648 is:

    1000 0000 0000 0000 0000 0000 0000 0000 
    ▲                                     ▲ 
    MSB                                   LSB

In expression 1U << ((sizeof(int) * CHAR_BIT) - 1), I shifts first bit the LSB(that is 1) to left most side at MSB, And because in C, setting signed bit is undefined behavior when operand is singed type so I used unsigned one 1U.

6.5.7 [Bitwise shift operators]

The result of E1 << E2 is E1 left-shifted E2 bit positions; vacated bits are filled with zeros. If E1 has an unsigned type, the value of the result is E1 × 2E2 , reduced modulo one more than the maximum value representable in the result type. If E1 has a signed type and nonnegative value, and E1 × 2E2 is representable in the result type, then that is the resulting value; otherwise, the behavior is undefined.

Another point to note is I used CHAR_BIT a standard macro defined in limits.h that tells number of bits in one char in a C implementation (remember: A char is always one byte size but number of bits in one bytes can be different on different system not always guaranteed to be 8).

INT_MAX: 2³¹ -1 == 2147483647

    0111 1111 1111 1111 1111 1111 1111 1111
    ▲                                     ▲ 
    MSB                                   LSB

What Do the C and C++ Standards Say About Bit-Level Integer Representation and Manipulation