Performance of Unsigned VS Signed Integers

performance of unsigned vs signed integers

Division by powers of 2 is faster with unsigned int, because it can be optimized into a single shift instruction. With signed int, it usually requires more machine instructions, because division rounds towards zero, but shifting to the right rounds down. Example:

int foo(int x, unsigned y)
{
x /= 8;
y /= 8;
return x + y;
}

Here is the relevant x part (signed division):

movl 8(%ebp), %eax
leal 7(%eax), %edx
testl %eax, %eax
cmovs %edx, %eax
sarl $3, %eax

And here is the relevant y part (unsigned division):

movl 12(%ebp), %edx
shrl $3, %edx

Performance difference of signed and unsigned integers of non-native length

There are two parts of this question: Chandler's point about optimizations based on overflow being undefined, and the differences you found in the assembly output.

Chandler's point is that if overflow is undefined behavior, then the compiler can assume that it cannot happen. Consider the following code:

typedef int T;
void CopyInts(int *dest, const int *src) {
T x = 0;
for (; src[x]; ++x) {
dest[x] = src[x];
}
}

Here, the compiler can safely change the for loop to the following:

    while (*src) {
*dest++ = *src++;
}

That's because the compiler does not have to worry about the case that x overflows. If the compiler has to worry about x overflowing, the source and destination pointers suddenly get 16 GB subtracted from them, so the simple transformation above will not work.

At the assembly level, the above is (with GCC 7.3.0 for x86-64, -O2):

_Z8CopyIntsPiPKii:
movl (%rsi), %edx
testl %edx, %edx
je .L1
xorl %eax, %eax
.L3:
movl %edx, (%rdi,%rax)
addq $4, %rax
movl (%rsi,%rax), %edx
testl %edx, %edx
jne .L3
.L1:
rep ret

If we change T to be unsigned int, we get this slower code instead:

_Z8CopyIntsPiPKij:
movl (%rsi), %eax
testl %eax, %eax
je .L1
xorl %edx, %edx
xorl %ecx, %ecx
.L3:
movl %eax, (%rdi,%rcx)
leal 1(%rdx), %eax
movq %rax, %rdx
leaq 0(,%rax,4), %rcx
movl (%rsi,%rax,4), %eax
testl %eax, %eax
jne .L3
.L1:
rep ret

Here, the compiler is keeping x as a separate variable so that overflow is handled properly.

Instead of relying on signed overflow being undefined for performance, you can use a size type that is the same size as a pointer. This means that such a variable could only overflow at the same time as a pointer, which is also undefined. Hence, at least for x86-64, size_t would also work as T to get the better performance.

Now for the second part of your question: the add instruction. The suffixes on the add instruction are from the so-called "AT&T" style of x86 assembly language. In AT&T assembly language, the parameters are backward from the way Intel writes instructions, and disambiguating instruction sizes is done by adding a suffix to the mnemonic instead of something like dword ptr in the Intel case.

Example:

Intel: add dword ptr [eax], 1

AT&T: addl $1, (%eax)

These are the same instruction, just written differently. The l takes the place of dword ptr.

In the case where the suffix is missing from AT&T instructions, it's because it's not required: the size is implicit from the operands.

add $1, %eax

The l suffix is unnecessary because the instruction is obviously 32-bit, because eax is.

In short, it has nothing to do with overflow. Overflow is always defined at the processor level. On some architectures, such as when using the non-u instructions on MIPS, overflow throws an exception, but it's still defined. C/C++ are the only major languages that make overflow unpredictable behavior.

Why A / constant-int is faster when A is unsigned vs signed?

After some toying around, I believe I've tracked down the source of the problem to be the guarantee by the standard that negative integer divisions are rounded towards zero since C++11. For the simplest case, which is division by two, check out the following code and the corresponding assembly (godbolt link).

constexpr int c = 2;

int signed_div(int in){
return in/c;
}

int unsigned_div(unsigned in){
return in/c;
}

Assembly:

signed_div(int):
mov eax, edi
shr eax, 31
add eax, edi
sar eax
ret

unsigned_div(unsigned int):
mov eax, edi
shr eax
ret

What do these extra instructions accomplish? shr eax, 31 (right shift by 31) just isolates the sign bit, meaning that if input is non-negative, eax == 0, otherwise eax == 1. Then the input is added to eax. In other words, these two instructions translate to "if input is negative, add 1 to it. The implications of the addition are the following (only for negative input).

  • If input is even, its least significant bit is set to 1, but the shift discards it. The output is not affected by this operation.

  • If input is odd, its least significant bit was already 1 so the addition causes a remainder to propagate to the rest of the digits. When the right shift occurs, the least significant bit is discarded and the output is greater by one than the output we'd have if we hadn't added the sign bit to the input. Because by default right-shift in two's complement rounds towards negative infinity, the output now is the result of the same division but rounded towards zero.

In short, even negative numbers aren't affected, and odd numbers are now rounded towards zero instead of towards negative infinity.

For non-power-of-2 constants it gets a bit more complicated. Not all constants give the same output, but for a lot of them it looks similar to the following (godbolt link).

constexpr int c = 3;

int signed_div(int in){
return in/c;
}

int unsigned_div(unsigned in){
return in/c;
}

Assembly:

signed_div(int):
mov eax, edi
mov edx, 1431655766
sar edi, 31
imul edx
mov eax, edx
sub eax, edi
ret
unsigned_div(unsigned int):
mov eax, edi
mov edx, -1431655765
mul edx
mov eax, edx
shr eax
ret

We don't care about the change of the constant in the assembly output, because it does not affect execution time. Assuming that mul and imul take the same amount of time (which I don't know for sure but hopefully someone more knowledgeable than me can find a source on it), the signed version once again takes longer because it has extra instructions to handle the sign bit for negative operands.


Notes

  • Compilation was done on godbot using x86-64 GCC 7.3 with the -O2 flag.

  • Rounds towards zero behavior is standard-mandated since C++11. Before it was implementation defined, according to this cppreference page.

Is there a difference in term of performance between unsigned int and int on the IPhone?

It always depends:

For loops signed integers as counters and limits are a tad faster because in C the compiler is free to assume that overflow never happends.

Consider this: You have a loop with an unsigned loop counter like this:

void function (unsigned int first, unsigned int last)
{
unsigned int i;
for (i=first; i!=last; i++)
{
// do something here...
}
}

In this loop the compiler must make sure that the loop even terminates if first is greater than last because I will wrap from UINT_MAX to 0 on overflow (just to name one example - there are other cases as well). This removes the opportunity for some loop optimizations. With signed loop counters the compiler assumes that wrap-around does not occur and may generate better code.

For integer division otoh unsigned integers are a tad faster on the ARM. The ARM does not has a hardware divide unit, so division is done in software, and is always done on unsigned values. You'll save some cycles for the extra-code required to turn a signed division into an unsigned division.

For all other things like arithmetic, logic, load and write to memory the choice of sign-ness will not make any difference.


Regarding the data-size: As Rune pointed out they are more or less of equal speed with 32 bit types beeing the fastest. Bytes and words sometimes need to be adjusted after processing as they reside in a 32 bit register and the upper (unused) bits need to be sign or zero extended.

However, the ARM CPU's have a relative small data-cache and are often connected to relative slow memory. If you're able to utilize the cache more efficient by choosing smaller data-types the code may executes faster even if the theoretical cycle-count goes up.

You have to experiment here.

Faster comparing signed than unsigned ints

You'd better cite a source for such a claim, which is ridiculous on the face of it.

We're talking about x86_64, which means modern processors. These ALUs will complete integer addition, subtraction, and/or comparison in a single clock cycle (cache misses will of course take longer, but depend only on the size and memory layout of the data, not signed-ness). Or even less, with the SIMD coprocessor.

I'd more likely to believe that there's a slight power difference between the two types of comparison, but not a speed difference.

Now, it is possible that for a particular compiler, code generation is worse for one data type vs the other when targeting the x86_64 platform. But that would be a very specialized case, and not likely to apply to all x86_64 compilers. And still, I'd suspect cache effects or background processes were affecting the performance measurement (even performance counters that measure time spent per-process will be affected by a context switch invalidating caches).

CUDA C best practices: unsigned vs signed optimization

The text shows this example:

for (i = 0; i < n; i++) {  
out[i] = in[offset + stride*i];
}

It also mentions "strength reduction". The compiler is allowed to replace this with the following "pseudo-optimised-C" code:

tmp = offset;
for (i = 0; i < n; i++) {
out[i] = in[tmp];
tmp += stride;
}

Now, imagine a processor that only supports floating point numbers (and integers as a subset). tmp would be of type "very large number".

Now, the C standard says that computations involving unsigned operands can never overflow, but instead are reduced modulo the largest value + 1. That means that in the case of unsigned i the compiler has to do this:

tmp = offset;
for (i = 0; i < n; i++) {
out[i] = in[tmp];
tmp += stride;
if (tmp > UINT_MAX)
{
tmp -= UINT_MAX + 1;
}
}

But in the case of signed integer the compiler can do whatever it wants. It doesn't need to check for overflow - if it does overflow then it's the developer's problem (it could cause an exception, or produce erroneous values). So the code can be faster.

Speed difference between using int and unsigned int when mixed with doubles

Here's why: many common architectures (including x86) have a hardware instruction to convert signed int to doubles, but do not have a hardware conversion from unsigned to double, so the compiler needs to synthesize the conversion in software. Furthermore, the only unsigned multiply on Intel is a full width multiply, whereas signed multiplies can use the signed multiply low instruction.

GCC's software conversion from unsigned int to double may very well be suboptimal (it almost certainly is, given the magnitude of the slowdown that you observed), but it is expected behavior for the code to be faster when using signed integers.

Assuming a smart compiler, the difference should be much smaller on a 64-bit system, because a 64-bit signed integer -> double conversion can be used to efficiently do a 32-bit unsigned conversion.

Edit: to illustrate, this:

sum += *data * x;

if the integer variables are signed, should compile into something along these lines:

mov       (data),   %eax
imul %ecx, %eax
cvtsi2sd %eax, %xmm1
addsd %xmm1, %xmm0

on the other hand, if the integer variables are unsigned, cvtsi2sd can't be used to do the conversion, so a software workaround is required. I would expect to see something like this:

    mov       (data),   %eax
mul %ecx // might be slower than imul
cvtsi2sd %eax, %xmm1 // convert as though signed integer
test %eax, %eax // check if high bit was set
jge 1f // if it was, we need to adjust the converted
addsd (2^32), %xmm1 // value by adding 2^32
1: addsd %xmm1, %xmm0

That would be "acceptable" codegen for the unsigned -> double conversion; it could easily be worse.

All of this is assuming floating-point code generation to SSE (I believe this is the default on the Ubuntu tools, but I could be wrong).



Related Topics



Leave a reply



Submit