Efficient 128-Bit Addition Using Carry Flag

Efficient 128-bit addition using carry flag

Actually gcc will use the carry automatically if you write your code carefully...

Current GCC can optimize hiWord += (loWord < loAdd); into add/adc (x86's add-with-carry). This optimization was introduced in GCC5.3.

With separate uint64_t chunks in 64-bit mode: https://godbolt.org/z/S2kGRz.
And the same thing in 32-bit mode with uint32_t chunks: https://godbolt.org/z/9FC9vc

(editor's note: Of course the hard part is writing a correct full-adder with carry in and carry out; that's hard in C and GCC doesn't know how to optimize any that I've seen.)

Also related: https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html can give you carry-out from unsigned, or signed-overflow detection.

Older GCC, like GCC4.5, will branch or setc on the carry-out from an add, instead of using adc, and only used adc (add-with-carry) on the flag-result from an add if you used __int128. (Or uint64_t on a 32-bit target). See Is there a 128 bit integer in gcc? - only on 64-bit targets, supported since GCC4.1.

I compiled this code with gcc -O2 -Wall -Werror -S:

void increment128_1(unsigned long &hiWord, unsigned long &loWord)
{
    const unsigned long hiAdd=0x0000062DE49B5241;
    const unsigned long loAdd=0x85DC198BCDD714BA;

    loWord += loAdd;
    if (loWord < loAdd) ++hiWord; // test_and_add_carry                                                                                                             
    hiWord += hiAdd;
}

void increment128_2(unsigned long &hiWord, unsigned long &loWord)
{
    const unsigned long hiAdd=0x0000062DE49B5241;
    const unsigned long loAdd=0x85DC198BCDD714BA;

    loWord += loAdd;
    hiWord += hiAdd;
    hiWord += (loWord < loAdd); // test_and_add_carry                                                                                                               
}

This is the assembly for increment128_1:

.cfi_startproc
        movabsq     $-8801131483544218438, %rax
        addq        (%rsi), %rax
        movabsq     $-8801131483544218439, %rdx
        cmpq        %rdx, %rax
        movq        %rax, (%rsi)
        ja  .L5
        movq        (%rdi), %rax
        addq        $1, %rax
.L3:
        movabsq     $6794178679361, %rdx
        addq        %rdx, %rax
        movq        %rax, (%rdi)
        ret

...and this is the assembly for increment128_2:

        movabsq     $-8801131483544218438, %rax
        addq        %rax, (%rsi)
        movabsq     $6794178679361, %rax
        addq        (%rdi), %rax
        movabsq     $-8801131483544218439, %rdx
        movq        %rax, (%rdi)
        cmpq        %rdx, (%rsi)
        setbe       %dl
        movzbl      %dl, %edx
        leaq        (%rdx,%rax), %rax
        movq        %rax, (%rdi)
        ret

Note the lack of conditional branches in the second version.

[edit]

Also, references are often bad for performance, because GCC has to worry about aliasing... It is often better to just pass things by value. Consider:

struct my_uint128_t {
    unsigned long hi;
    unsigned long lo;
};

my_uint128_t increment128_3(my_uint128_t x)
{
    const unsigned long hiAdd=0x0000062DE49B5241;
    const unsigned long loAdd=0x85DC198BCDD714BA;

    x.lo += loAdd;
    x.hi += hiAdd + (x.lo < loAdd);
    return x;
}

Assembly:

        .cfi_startproc
        movabsq     $-8801131483544218438, %rdx
        movabsq     $-8801131483544218439, %rax
        movabsq     $6794178679362, %rcx
        addq        %rsi, %rdx
        cmpq        %rdx, %rax
        sbbq        %rax, %rax
        addq        %rcx, %rax
        addq        %rdi, %rax
        ret

This is actually the tightest code of the three.

...OK so none of them actually used the carry automatically :-). But they do avoid the conditional branch, which I bet is the slow part (since the branch prediction logic will get it wrong half the time).

[edit 2]

And one more, which I stumbled across doing a little searching. Did you know GCC has built-in support for 128-bit integers?

typedef unsigned long my_uint128_t __attribute__ ((mode(TI)));

my_uint128_t increment128_4(my_uint128_t x)
{
    const my_uint128_t hiAdd=0x0000062DE49B5241;
    const unsigned long loAdd=0x85DC198BCDD714BA;

    return x + (hiAdd << 64) + loAdd;
}

The assembly for this one is about as good as it gets:

        .cfi_startproc
        movabsq     $-8801131483544218438, %rax
        movabsq     $6794178679361, %rdx
        pushq       %rbx
        .cfi_def_cfa_offset 16
        addq        %rdi, %rax
        adcq        %rsi, %rdx
        popq        %rbx
        .cfi_offset 3, -16
        .cfi_def_cfa_offset 8
        ret

(Not sure where the push/pop of ebx came from, but this is still not bad.)

All of these are with GCC 4.5.2, by the way.

carry flag on MUL overflow bigger than 2^128?

Refer Intel's Manual please... https://www.felixcloutier.com/x86/mul#flags-affected

MUL clears the CF if the upper half of the result (RDX, EDX, DX, or AH) is zero, else it's set.

mul has two inputs of the same width. Only the output is double-width.

CF is not used to measure overflow of the whole thing, as two 64 bit multiplicands cannot result into a 129 bit result, so an overflow in that sense is not possible.

CF actually indicates overflow of the low half. Original 8086 didn't have non-widening multiply (imul ecx, esi), so you had to use mul or imul even if you didn't care about the high-half result. Knowing that the result still fit in one register was potentially useful, especially in a 16-bit world where values wider than the widest available register were more common than on x86-64.

Is it possible to use SSE and SSE2 to make a 128-bit wide integer?

SIMD is meant to work on multiple small values at the same time, hence there won't be any carry over to the higher unit and you must do that manually. In SSE2 there's no carry flag but you can easily calculate the carry as carry = sum < a or carry = sum < b like this. Worse yet, SSE2 doesn't have 64-bit comparisons either, so you must use some workaround like the one here

Here is an untested, unoptimized C code based on the idea above:

inline bool lessthan(__m128i a, __m128i b){
    a = _mm_xor_si128(a, _mm_set1_epi32(0x80000000));
    b = _mm_xor_si128(b, _mm_set1_epi32(0x80000000));
    __m128i t = _mm_cmplt_epi32(a, b);
    __m128i u = _mm_cmpgt_epi32(a, b);
    __m128i z = _mm_or_si128(t, _mm_shuffle_epi32(t, 177));
    z = _mm_andnot_si128(_mm_shuffle_epi32(u, 245),z);
    return _mm_cvtsi128_si32(z) & 1;
}

inline __m128i addi128(__m128i a, __m128i b)
{
    __m128i sum = _mm_add_epi64(a, b);
    __m128i mask = _mm_set1_epi64(0x8000000000000000);    
    if (lessthan(_mm_xor_si128(mask, sum), _mm_xor_si128(mask, a)))
    {
        __m128i ONE = _mm_setr_epi64(0, 1);
        sum = _mm_add_epi64(sum, ONE);
    }

    return sum;
}

As you can see, the code requires many more instructions and even after optimizing it may still be much longer than a simple 2 ADD/ADC pair in x86_64 (or 4 instructions in x86)

SSE2 will help though, if you have multiple 128-bit integers to add in parallel. However you need to arrange the high and low parts of the values properly so that we can add all the low parts at once, and all the high parts at once

An efficient way to do basic 128 bit integer calculations in C++?

Update: Since the OP hasn't accepted an answer yet <hint><hint>, I've attached a bit more code.

Using the libraries discussed above is probably a good idea. While you might only need a few functions today, eventually you may find that you need one more. Then one more after that. Until eventually you end up writing, debugging and maintaining your own 128bit math library. Which is a waste of your time and effort.

That said. If you are determined to roll your own:

1) The cuda question you asked previously already has c code for multiplication. Was there some problem with it?

2) The shift probably won't benefit from using asm, so a c solution makes sense to me here as well. ~~Although if performance is really an issue here, I'd see if the Edison supports SHLD/SHRD, which might make this a bit faster. Otherwise, m~~ Maybe an approach like this?

my_uint128_t lshift_uint128 (const my_uint128_t a, int b)
{
   my_uint128_t res;
   if (b < 32) {    
      res.x = a.x << b;
      res.y = (a.y << b) | (a.x >> (32 - b));
      res.z = (a.z << b) | (a.y >> (32 - b));
      res.w = (a.w << b) | (a.z >> (32 - b));
   } elseif (b < 64) {
      ...
   }

   return res;
}

Update: Since it appears that the Edison may support SHLD/SHRD, here's an alternative which might be more performant than the 'c' code above. As with all code purporting to be faster, you should test it.

inline
unsigned int __shld(unsigned int into, unsigned int from, unsigned int c)
{
   unsigned int res;

   if (__builtin_constant_p(into) &&
       __builtin_constant_p(from) &&
       __builtin_constant_p(c))
   {
      res = (into << c) | (from >> (32 - c));
   }
   else
   {
      asm("shld %b3, %2, %0"
          : "=rm" (res)
          : "0" (into), "r" (from), "ic" (c)
          : "cc");
   }

   return res;
}

inline
unsigned int __shrd(unsigned int into, unsigned int from, unsigned int c)
{
   unsigned int res;

   if (__builtin_constant_p(into) && 
       __builtin_constant_p(from) && 
       __builtin_constant_p(c))
   {
      res = (into >> c) | (from << (32 - c));
   }
   else
   {
      asm("shrd %b3, %2, %0"
          : "=rm" (res)
          : "0" (into), "r" (from), "ic" (c)
          : "cc");
   }

   return res;
}

my_uint128_t lshift_uint128 (const my_uint128_t a, unsigned int b)
{
   my_uint128_t res;

   if (b < 32) {
      res.x = a.x << b;
      res.y = __shld(a.y, a.x, b);
      res.z = __shld(a.z, a.y, b);
      res.w = __shld(a.w, a.z, b);
   } else if (b < 64) {
      res.x = 0;
      res.y = a.x << (b - 32);
      res.z = __shld(a.y, a.x, b - 32);
      res.w = __shld(a.z, a.y, b - 32);
   } else if (b < 96) {
      res.x = 0;
      res.y = 0;
      res.z = a.x << (b - 64);
      res.w = __shld(a.y, a.x, b - 64);
   } else if (b < 128) {
      res.x = 0;
      res.y = 0;
      res.z = 0;
      res.w = a.x << (b - 96);
   } else {
      memset(&res, 0, sizeof(res));
   }

   return res;
}

my_uint128_t rshift_uint128 (const my_uint128_t a, unsigned int b)
{
   my_uint128_t res;

   if (b < 32) {
      res.x = __shrd(a.x, a.y, b);
      res.y = __shrd(a.y, a.z, b);
      res.z = __shrd(a.z, a.w, b);
      res.w = a.w >> b;
   } else if (b < 64) {
      res.x = __shrd(a.y, a.z, b - 32);
      res.y = __shrd(a.z, a.w, b - 32);
      res.z = a.w >> (b - 32);
      res.w = 0;
   } else if (b < 96) {
      res.x = __shrd(a.z, a.w, b - 64);
      res.y = a.w >> (b - 64);
      res.z = 0;
      res.w = 0;
   } else if (b < 128) {
      res.x = a.w >> (b - 96);
      res.y = 0;
      res.z = 0;
      res.w = 0;
   } else {
      memset(&res, 0, sizeof(res));
   }

   return res;
}

3) The addition may benefit from asm. You could try this:

struct my_uint128_t
{
   unsigned int x;
   unsigned int y;
   unsigned int z;
   unsigned int w;
};

my_uint128_t add_uint128 (const my_uint128_t a, const my_uint128_t b)
{
   my_uint128_t res;

    asm ("addl %5, %[resx]\n\t"
         "adcl %7, %[resy]\n\t"
         "adcl %9, %[resz]\n\t"
         "adcl %11, %[resw]\n\t"
         : [resx] "=&r" (res.x), [resy] "=&r" (res.y), 
           [resz] "=&r" (res.z), [resw] "=&r" (res.w)
         : "%0"(a.x), "irm"(b.x), 
           "%1"(a.y), "irm"(b.y), 
           "%2"(a.z), "irm"(b.z), 
           "%3"(a.w), "irm"(b.w)
         : "cc");

   return res;
}

I just dashed this off, so use at your own risk. I don't have an Edison, but this works with x86.

Update: If you are just doing accumulation (think to += from instead of the code above which is c = a + b), this code might serve you better:

inline
void addto_uint128 (my_uint128_t *to, const my_uint128_t from)
{
   asm ("addl %[fromx], %[tox]\n\t"
        "adcl %[fromy], %[toy]\n\t"
        "adcl %[fromz], %[toz]\n\t"
        "adcl %[fromw], %[tow]\n\t"
        : [tox] "+&r"(to->x), [toy] "+&r"(to->y), 
          [toz] "+&r"(to->z), [tow] "+&r"(to->w)
        : [fromx] "irm"(from.x), [fromy] "irm"(from.y), 
          [fromz] "irm"(from.z), [fromw] "irm"(from.w)
        : "cc");
}

big integer addition without carry flag

You can use "nails" (a term from GMP): rather than using all 64 bits of a uint64_t when representing a number, use only 63 of them, with the top bit zero. That way you can detect overflow with a simple bit-shift. You may even want less than 63.

Or, you can do half-word arithmetic. If you can do 64-bit arithmetic, represent your number as an array of uint32_ts (or equivalently, split 64-bit words into upper and lower 32-bit chunks). Then, when doing arithmetic operations on these 32-bit integers, you can first promote to 64 bits do the arithmetic there, then convert back. This lets you detect carry, and it's also good for multiplication if you don't have a "multiply hi" instruction.

As the other answer indicates, you can detect overflow in an unsigned addition by:

uint64_t sum = a + b;
uint64_t carry = sum < a;

As an aside, while in practice this will also work in signed arithmetic, you have two issues:

It's more complex
Technically, overflowing a signed integer is undefined behavior

so you're usually better off sticking to unsigned numbers.

Carry flag set on signed arithmetic

The addition is done as Two's Complement, and the result is larger than 8 bit. 0x80 + 0x80 = 0x100 or in binary:

   0b10000000
+  0b10000000
-------------
= 0b100000000

and thus the result is 0 and the Carry Flag is set.

Don't think of -128 as a negative number, rather think of it as the positive number (128) which has the same bit pattern as the two's compliment of your negative number, then carry out the unsigned addition. (And therefore the carry flag is set.)

Setting and Clearing the Carry Flag with Addition and Subtraction

The best way to clear the carry flag is to use the CLC instruction; and the best way to set the carry flag is to use the STC instruction.

If you must do it with addition or subtraction for some bizarre reason; the least worst method (for code size) is likely sub eax,eax to clear the carry flag, and xor eax,eax; sub eax,1 to set the carry flag.

Note: to set the carry flag, stc; sbb eax,eax would be even less bad, but that solution probably makes the pointlessness too obvious.

Efficient 128-Bit Addition Using Carry Flag