Visual C++ X64 Add with Carry

Visual C++ x64 add with carry

There is now an instrinsic for ADC in MSVC: _addcarry_u64. The following code

#include <inttypes.h>
#include <intrin.h>
#include <stdio.h>

typedef struct {
    uint64_t x1;
    uint64_t x2;
    uint64_t x3;
    uint64_t x4;
} uint256;

void add256(uint256 *x, uint256 *y) {
    unsigned char c = 0;
    c = _addcarry_u64(c, x->x1, y->x1, &x->x1);
    c = _addcarry_u64(c, x->x2, y->x2, &x->x2);
    c = _addcarry_u64(c, x->x3, y->x3, &x->x3);
    _addcarry_u64(c, x->x4, y->x4, &x->x4);
}

int main() {
    //uint64_t x1, x2, x3, x4;
    //uint64_t y1, y2, y3, y4;
    uint256 x, y;
    x.x1 = x.x2 = x.x3 = -1; x.x4 = 0;
    y.x1 = 2; y.x2 = y.x3 = y.x4 = 0;

    printf(" %016" PRIx64 "%016" PRIx64 "%016" PRIx64 "%016" PRIx64 "\n", x.x4, x.x3, x.x2, x.x1);
    printf("+");
    printf("%016" PRIx64 "%016" PRIx64 "%016" PRIx64 "%016" PRIx64 "\n", y.x4, y.x3, y.x2, y.x1);
    add256(&x, &y);
    printf("=");
    printf("%016" PRIx64 "%016" PRIx64 "%016" PRIx64 "%016" PRIx64 "\n", x.x4, x.x3, x.x2, x.x1);
}

produces the following assembly output from Visual Studio Express 2013

mov rdx, QWORD PTR x$[rsp]
mov r8, QWORD PTR x$[rsp+8] 
mov r9, QWORD PTR x$[rsp+16]
mov rax, QWORD PTR x$[rsp+24]
add rdx, QWORD PTR y$[rsp]
adc r8, QWORD PTR y$[rsp+8]
adc r9, QWORD PTR y$[rsp+16]
adc rax, QWORD PTR y$[rsp+24]

which has one add and three adc as expected.

Edit:

There seems to be some confusion as to what _addcarry_u64 does. If you look at Microsoft's documentation for this which I linked to at the start of this answer it shows that it does not require any special hardware. This produces adc and it will work on all x86-64 processors (and _addcarry_u32 would work on even older processors). It works fine on the Ivy Bridge system I tested it on.

However, _addcarryx_u64 does require adx (as shown in MSFT's documentation) and indeed it fails to run on my Ivy Bridge System.

assembly and Visual C++ Express 2010 64 Bit

The x64 C++ compiler doesn't support inline assembly, you need to put your assembly code in a separate file.

There is no built-in intrinsic for adc, but you can easily emulate it.

Embed assembler to manipulate 64-bit registers in portable C++

Just to give you a taste of the obstacles that lie in your path, here is a simple inline assembler function, in two dialects. First, the Borland C++ Builder version (I think this compiles under MSVC++ too):

int BNASM_AddScalar (DWORD* result, DWORD x)
  {
  int carry = 0 ;
  __asm
    {
    mov     ebx,result
    xor     eax,eax
    mov     ecx,x
    add     [ebx],ecx
    adc     carry,eax    // Return the carry flag
    }
  return carry ;
  }

Now, the g++ version:

int BNASM_AddScalar (DWORD* result, DWORD x)
  {
  int carry = 0 ;
  asm volatile (
"    addl    %%ecx,(%%edx)\n"
"    adcl    $0,%%eax\n"    // Return the carry flag
: "+a"(carry)         // Output (and input): carry in eax
: "d"(result), "c"(x) // Input: result in edx and x in ecx
) ;
  return carry ;
  }

As you can see, the differences are major. And there is no way around them. These are from a large integer arithmetic library that I wrote for a 32-bit environment.

As for embedding 64-bit instructions in a 32-bit executable, I think this is forbidden. As I understand it, a 32-bit executable runs in 32-bit mode, any 64-bit instruction just generates a trap.

how to use SSE instruction in the x64 architecture in c++?

The modern method to use assembly instructions in C/C++ is to use intrinsics. Intrinsics have several advantages over inline assembly such as:

You don't have to worry about 32-bit and 64-bit mode.
You don't need to worry about registers and register spilling.
No need to worry AT&T and Intel Syntax.
No need to worry about calling conversions.
The compiler can optimize intrinsics further which it won't do with inline assembly.
Intrinsics are compatible (for the most intrinsics) with GCC, MSVC, ICC, and Clang.

I also like intrinsics because it's easy to emulate hardware with them for example to prepare for AVX512.

You can find the list of Intrinsics MSVC supports here. Intel has better information on intrinsics as well which agrees mostly with MSVC's intrinsics.

But sometimes you still need or want inline assembly. In my opinion it's really stupid that Microsoft does not allow inline assembly in 64-bit mode. This means they have to define intrinsics for several things that other compilers can still do with inline assembly. One example is CPUID. Visual Studio has an intrinsic for CPUID but GCC still uses inline assembly. Another example is adc. For a long time MSVC had no intrinsic for adc but now it appears they do.

Additionally, because they have to create intrinsics for everything it causes confusion. They have to create an intrinsic for mulx but the Intel's documentation for this is wrong. They also have to create intrinics for adcx and adox as well but their documentation disagrees with Intel's and the generated assembly shows that no intrinsic produces adox. So once again the programmer is left waiting for an intrinsic for adox. If they had just allowed inline assembly then there would be no problem.

But back to SSE. With few exceptions, e.g. _mm_set_epi64x in 32-bit mode on MSVC (I don't know if that's been fixed) the SSE/AVX/AVX2 intrinsics work as expected with MSVC, GCC, ICC, and Clang.

_umul128 on Windows 32 bits

I found the following code (from xmrrig), which seems to do the job just fine:

static inline uint64_t __umul128(uint64_t multiplier, uint64_t multiplicand, 
    uint64_t *product_hi) 
{
    // multiplier   = ab = a * 2^32 + b
    // multiplicand = cd = c * 2^32 + d
    // ab * cd = a * c * 2^64 + (a * d + b * c) * 2^32 + b * d
    uint64_t a = multiplier >> 32;
    uint64_t b = multiplier & 0xFFFFFFFF;
    uint64_t c = multiplicand >> 32;
    uint64_t d = multiplicand & 0xFFFFFFFF;

    //uint64_t ac = a * c;
    uint64_t ad = a * d;
    //uint64_t bc = b * c;
    uint64_t bd = b * d;

    uint64_t adbc = ad + (b * c);
    uint64_t adbc_carry = adbc < ad ? 1 : 0;

    // multiplier * multiplicand = product_hi * 2^64 + product_lo
    uint64_t product_lo = bd + (adbc << 32);
    uint64_t product_lo_carry = product_lo < bd ? 1 : 0;
    *product_hi = (a * c) + (adbc >> 32) + (adbc_carry << 32) + product_lo_carry;

    return product_lo;
}

Visual C++ X64 Add with Carry