Is Using Unsigned Integer Overflow Good Practice

Is using unsigned integer overflow good practice?

Unsigned integer overflow (in the shape of wrap-around) is routinely taken advantage of in hashing functions, and has been since the year dot.

Why is unsigned integer overflow defined behavior but signed integer overflow isn't?

The historical reason is that most C implementations (compilers) just used whatever overflow behaviour was easiest to implement with the integer representation it used. C implementations usually used the same representation used by the CPU - so the overflow behavior followed from the integer representation used by the CPU.

In practice, it is only the representations for signed values that may differ according to the implementation: one's complement, two's complement, sign-magnitude. For an unsigned type there is no reason for the standard to allow variation because there is only one obvious binary representation (the standard only allows binary representation).

Relevant quotes:

C99 6.2.6.1:3:

Values stored in unsigned bit-fields and objects of type unsigned char shall be represented using a pure binary notation.

C99 6.2.6.2:2:

If the sign bit is one, the value shall be modified in one of the following ways:

— the corresponding value with sign bit 0 is negated (sign and magnitude);

— the sign bit has the value −(2N) (two’s complement);

— the sign bit has the value −(2N − 1) (one’s complement).


Nowadays, all processors use two's complement representation, but signed arithmetic overflow remains undefined and compiler makers want it to remain undefined because they use this undefinedness to help with optimization. See for instance this blog post by Ian Lance Taylor or this complaint by Agner Fog, and the answers to his bug report.

When to use unsigned values over signed ones?

I was glad to find a good conversation on this subject, as I hadn't really given it much thought before.

In summary, signed is a good general choice - even when you're dead sure all the numbers are positive - if you're going to do arithmetic on the variable (like in a typical for loop case).

unsigned starts to make more sense when:

  • You're going to do bitwise things like masks, or
  • You're desperate to to take advantage of the sign bit for that extra positive range .

Personally, I like signed because I don't trust myself to stay consistent and avoid mixing the two types (like the article warns against).

Overflowing of Unsigned Int

unsigned numbers can't overflow, but instead wrap around using the properties of modulo.

For instance, when unsigned int is 32 bits, the result would be: (a * b) mod 2^32.


As CharlesBailey pointed out, 253473829*13482018273 may use signed multiplication before being converted, and so you should be explicit about unsigned before the multiplication:

unsigned int someint = 253473829U * 13482018273U;

Should unsigned ints be used if not necessary?

The reason to use uints is that it gives the compiler a wider variety of optimizations. For example, it may replace an instance of 'abs(x)' with 'x' if it knows that x is positive. It also opens up a variety of bitwise 'strength reductions' that only work for positive numbers. If you always mult/divide an int by a power of two, then the compiler may replace the operation with a bit shift (ie x*8 == x<<3) which tends to perform much faster. Unfortunately, this relation only holds if 'x' is positive because negative numbers are encoded in a way that precludes this. With ints, the compiler may apply this trick if it can prove that the value is always positive (or can be modified earlier in the code to be so). In the case of uints, this attribute is trivial to prove, which greatly increases the odds of it being applied.

Another example might be the equation y = 16 * x + 12. If x can be negative, then a multiply and add would be required. Yet if x is always positive, then not only can the x*16 term be replaced with x<<4, but since the term would always end with four zeros this opens up replacing the '+ 12' with a binary OR (as long as the '12' term is less than 16). The result would be y = (x<<4) | 12.

In general, the 'unsigned' qualifier gives the compiler more information about the variable, which in turn allows it to squeeze in more optimizations.

Is using an unsigned rather than signed int more likely to cause bugs? Why?

Some of the answers here mention the surprising promotion rules between signed and unsigned values, but that seems more like a problem relating to mixing signed and unsigned values, and doesn't necessarily explain why signed variables would be preferred over unsigned outside of mixing scenarios.

In my experience, outside of mixed comparisons and promotion rules, there are two primary reasons why unsigned values are bug magnets as follows.

Unsigned values have a discontinuity at zero, the most common value in programming

Both unsigned and signed integers have a discontinuities at their minimum and maximum values, where they wrap around (unsigned) or cause undefined behavior (signed). For unsigned these points are at zero and UINT_MAX. For int they are at INT_MIN and INT_MAX. Typical values of INT_MIN and INT_MAX on system with 4-byte int values are -2^31 and 2^31-1, and on such a system UINT_MAX is typically 2^32-1.

The primary bug-inducing problem with unsigned that doesn't apply to int is that it has a discontinuity at zero. Zero, of course, is a very common value in programs, along with other small values like 1,2,3. It is common to add and subtract small values, especially 1, in various constructs, and if you subtract anything from an unsigned value and it happens to be zero, you just got a massive positive value and an almost certain bug.

Consider code iterates over all values in a vector by index except the last0.5:

for (size_t i = 0; i < v.size() - 1; i++) { // do something }

This works fine until one day you pass in an empty vector. Instead of doing zero iterations, you get v.size() - 1 == a giant number1 and you'll do 4 billion iterations and almost have a buffer overflow vulnerability.

You need to write it like this:

for (size_t i = 0; i + 1 < v.size(); i++) { // do something }

So it can be "fixed" in this case, but only by carefully thinking about the unsigned nature of size_t. Sometimes you can't apply the fix above because instead of a constant one you have some variable offset you want to apply, which may be positive or negative: so which "side" of the comparison you need to put it on depends on the signedness - now the code gets really messy.

There is a similar issue with code that tries to iterate down to and including zero. Something like while (index-- > 0) works fine, but the apparently equivalent while (--index >= 0) will never terminate for an unsigned value. Your compiler might warn you when the right hand side is literal zero, but certainly not if it is a value determined at runtime.

Counterpoint

Some might argue that signed values also have two discontinuities, so why pick on unsigned? The difference is that both discontinuities are very (maximally) far away from zero. I really consider this a separate problem of "overflow", both signed and unsigned values may overflow at very large values. In many cases overflow is impossible due to constraints on the possible range of the values, and overflow of many 64-bit values may be physically impossible). Even if possible, the chance of an overflow related bug is often minuscule compared to an "at zero" bug, and overflow occurs for unsigned values too. So unsigned combines the worst of both worlds: potentially overflow with very large magnitude values, and a discontinuity at zero. Signed only has the former.

Many will argue "you lose a bit" with unsigned. This is often true - but not always (if you need to represent differences between unsigned values you'll lose that bit anyways: so many 32-bit things are limited to 2 GiB anyways, or you'll have a weird grey area where say a file can be 4 GiB, but you can't use certain APIs on the second 2 GiB half).

Even in the cases where unsigned buys you a bit: it doesn't buy you much: if you had to support more than 2 billion "things", you'll probably soon have to support more than 4 billion.

Logically, unsigned values are a subset of signed values

Mathematically, unsigned values (non-negative integers) are a subset of signed integers (just called _integers).2. Yet signed values naturally pop out of operations solely on unsigned values, such as subtraction. We might say that unsigned values aren't closed under subtraction. The same isn't true of signed values.

Want to find the "delta" between two unsigned indexes into a file? Well you better do the subtraction in the right order, or else you'll get the wrong answer. Of course, you often need a runtime check to determine the right order! When dealing with unsigned values as numbers, you'll often find that (logically) signed values keep appearing anyways, so you might as well start of with signed.

Counterpoint

As mentioned in footnote (2) above, signed values in C++ aren't actually a subset of unsigned values of the same size, so unsigned values can represent the same number of results that signed values can.

True, but the range is less useful. Consider subtraction, and unsigned numbers with a range of 0 to 2N, and signed numbers with a range of -N to N. Arbitrary subtractions result in results in the range -2N to 2N in _both cases, and either type of integer can only represent half of it. Well it turns out that the region centered around zero of -N to N is usually way more useful (contains more actual results in real world code) than the range 0 to 2N. Consider any of typical distribution other than uniform (log, zipfian, normal, whatever) and consider subtracting randomly selected values from that distribution: way more values end up in [-N, N] than [0, 2N] (indeed, resulting distribution is always centered at zero).

64-bit closes the door on many of the reasons to use unsigned values as numbers

I think the arguments above were already compelling for 32-bit values, but the overflow cases, which affect both signed and unsigned at different thresholds, do occur for 32-bit values, since "2 billion" is a number that can exceeded by many abstract and physical quantities (billions of dollars, billions of nanoseconds, arrays with billions of elements). So if someone is convinced enough by the doubling of the positive range for unsigned values, they can make the case that overflow does matter and it slightly favors unsigned.

Outside of specialized domains 64-bit values largely remove this concern. Signed 64-bit values have an upper range of 9,223,372,036,854,775,807 - more than nine quintillion. That's a lot of nanoseconds (about 292 years worth), and a lot of money. It's also a larger array than any computer is likely to have RAM in a coherent address space for a long time. So maybe 9 quintillion is enough for everybody (for now)?

When to use unsigned values

Note that the style guide doesn't forbid or even necessarily discourage use of unsigned numbers. It concludes with:

Do not use an unsigned type merely to assert that a variable is non-negative.

Indeed, there are good uses for unsigned variables:

  • When you want to treat an N-bit quantity not as an integer, but simply a "bag of bits". For example, as a bitmask or bitmap, or N boolean values or whatever. This use often goes hand-in-hand with the fixed width types like uint32_t and uint64_t since you often want to know the exact size of the variable. A hint that a particular variable deserves this treatment is that you only operate on it with with the bitwise operators such as ~, |, &, ^, >> and so on, and not with the arithmetic operations such as +, -, *, / etc.

    Unsigned is ideal here because the behavior of the bitwise operators is well-defined and standardized. Signed values have several problems, such as undefined and unspecified behavior when shifting, and an unspecified representation.

  • When you actually want modular arithmetic. Sometimes you actually want 2^N modular arithmetic. In these cases "overflow" is a feature, not a bug. Unsigned values give you what you want here since they are defined to use modular arithmetic. Signed values cannot be (easily, efficiently) used at all since they have an unspecified representation and overflow is undefined.


0.5 After I wrote this I realized this is nearly identical to Jarod's example, which I hadn't seen - and for good reason, it's a good example!

1 We're talking about size_t here so usually 2^32-1 on a 32-bit system or 2^64-1 on a 64-bit one.

2 In C++ this isn't exactly the case because unsigned values contain more values at the upper end than the corresponding signed type, but the basic problem exists that manipulating unsigned values can result in (logically) signed values, but there is no corresponding issue with signed values (since signed values already include unsigned values).

Is it a good practice to use long int to avoid overflow?

There's no simple, general, portable way to avoid integer overflow.

Using a wider integer type is not a general solution, because there's no guarantee that there is a wider integer type. It's very common for the types int and long to be the same size on some systems. 32-bit systems typically make both int and long 32 bits; even some 64-bit systems do the same thing. And some 64-bit systems make both int and long 64 bits.

You wrote:

And I have been provided with the information that the variable res cannot cross integer limit.

It's best to be careful with terminology. Although the type name int is obviously an abbreviation of the word "integer", the words are not synonymous. In C and C++, int is just one of several integer types; others include unsigned char and long long.

Presumably what you mean is that the value of res must not be outside the range of type int, which is INT_MIN to INT_MAX.

If you use long long for the result, it's likely that you can avoid overflow; you can then compare the result to the values INT_MIN and INT_MAX, and take whatever corrective action is appropriate if it's out of range. But it's still possible for both int and long long to be 64 bits. What you do with that depends on how portable your code needs to be.

You cannot safely check whether a signed integer addition or subtraction overflowed after the fact. An overflow in signed arithmetic causes undefined behavior. Typically the result wraps around, but in principle your program could crash before you have a chance to examine the result.

Some compilers might provide functions that perform safe signed arithmetic and tell you whether there was an overflow. Consult your compiler's documentation.

There are ways to test the operands before performing the operation. They're complicated, and I'm too lazy to work out the details, but briefly:

  • If the operands of + are of opposite signs, or if one operand is 0, the addition is safe.
  • If both operands are positive, the addition is safe if x does not exceed INT_MAX - y.
  • If both operands are negative, the addition is safe if y is no less than INT_MIN - y.

I do not guarantee that the above is completely correct *I just wrote it off the top of my head), but in most cases it's probably more effort than it's worth.

C/C++ use of int or unsigned int

Using unsigned can introduce programming errors that are hard to spot, and it's usually better to use signed int just to avoid them. One example would be when you decide to iterate backwards rather than forwards and write this:

for (unsigned i = 5; i >= 0; i--) {
printf("%d\n", i);
}

Another would be if you do some math inside the loop:

for (unsigned i = 0; i < 10; i++) {
for (unsigned j = 0; j < 10; j++) {
if (i - j >= 4) printf("%d %d\n", i, j);
}
}

Using unsigned introduces the potential for these sorts of bugs, and there's not really any upside.



Related Topics



Leave a reply



Submit