Reasonably Portable Way to Get Top 64-Bits from 64X64 Bit Multiply

Reasonably portable way to get top 64-bits from 64x64 bit multiply?

This answer shows how to get the (exact) top 64-bits from a 64x64 bit multiply on a system that doesn't support 128-bit integers. The answer by @amdn will give better performance on systems that do support 128-bit integers.

The diagram below shows one method for computing a 128-bit product from two 64-bit numbers. Each black rectangle represents a 64-bit number. The 64-bit inputs to the method, X and Y, are divided into 32-bits chunks labeled a, b, c, and d. Then four 32x32 bit multiplications are performed, giving four 64-bit products labeled a*c, b*c, a*d, and b*d. The four products must be shifted and added to compute the final answer.

Sample Image

Note that the lower 32-bits of the 128-bit product are solely determined by the lower 32-bits of partial product b*d. The next 32-bits are determined by the lower 32-bits of the following

mid34 = ((b*c) & 0xffffffff) + ((a*d) & 0xffffffff) + ((b*d) >> 32);

Note that mid34 is the sum of three 32-bit numbers and therefore is in fact a 34-bit sum. The upper two bits of mid34 act as a carry into the top 64-bits of the 64x64 bit multiply.

Which brings us to the demo code. The top64 function computes the upper 64-bits of a 64x64 multiply. It's a little verbose to allow for the calculation of the lower 64-bits to be shown in a comment. The main function takes advantage of 128-bit integers to verify the results with a simple test case. Further testing is left as an exercise for the reader.

#include <stdio.h>
#include <stdint.h>

typedef unsigned __int128 uint128_t;

uint64_t top64( uint64_t x, uint64_t y )
{
    uint64_t a = x >> 32, b = x & 0xffffffff;
    uint64_t c = y >> 32, d = y & 0xffffffff;

    uint64_t ac = a * c;
    uint64_t bc = b * c;
    uint64_t ad = a * d;
    uint64_t bd = b * d;

    uint64_t mid34 = (bd >> 32) + (bc & 0xffffffff) + (ad & 0xffffffff);

    uint64_t upper64 = ac + (bc >> 32) + (ad >> 32) + (mid34 >> 32);
//  uint64_t lower64 = (mid34 << 32) | (bd & 0xffffffff);

    return upper64;
}

int main( void )
{
    uint64_t x = 0x0000000100000003;
    uint64_t y = 0x55555555ffffffff;

    uint128_t m = x, n = y;
    uint128_t p = m * n;

    uint64_t top = p >> 64;
    printf( "%016llx %016llx\n", top, top64( x, y ) );
}

Getting the high part of 64 bit integer multiplication

If you're using gcc and the version you have supports 128 bit numbers (try using __uint128_t) then performing the 128 multiply and extracting the upper 64 bits is likely to be the most efficient way of getting the result.

If your compiler doesn't support 128 bit numbers, then Yakk's answer is correct. However, it may be too brief for general consumption. In particular, an actual implementation has to be careful of overflowing 64 bit integers.

The simple and portable solution he proposes is to break each of a and b into 2 32-bit numbers and then multiply those 32 bit numbers using the 64 bit multiply operation. If we write:

uint64_t a_lo = (uint32_t)a;
uint64_t a_hi = a >> 32;
uint64_t b_lo = (uint32_t)b;
uint64_t b_hi = b >> 32;

then it is obvious that:

a = (a_hi << 32) + a_lo;
b = (b_hi << 32) + b_lo;

and:

a * b = ((a_hi << 32) + a_lo) * ((b_hi << 32) + b_lo)
      = ((a_hi * b_hi) << 64) +
        ((a_hi * b_lo) << 32) +
        ((b_hi * a_lo) << 32) +
          a_lo * b_lo

provided the calculation is performed using 128 bit (or greater) arithmetic.

But this problem requires that we perform all the calculcations using 64 bit arithmetic, so we have to worry about overflow.

Since a_hi, a_lo, b_hi, and b_lo are all unsigned 32 bit numbers, their product will fit in an unsigned 64 bit number without overflow. However, the intermediate results of the above calculation will not.

The following code will implement mulhi(a, b) when the mathemetics must be performed modulo 2^64:

uint64_t    a_lo = (uint32_t)a;
uint64_t    a_hi = a >> 32;
uint64_t    b_lo = (uint32_t)b;
uint64_t    b_hi = b >> 32;

uint64_t    a_x_b_hi =  a_hi * b_hi;
uint64_t    a_x_b_mid = a_hi * b_lo;
uint64_t    b_x_a_mid = b_hi * a_lo;
uint64_t    a_x_b_lo =  a_lo * b_lo;

uint64_t    carry_bit = ((uint64_t)(uint32_t)a_x_b_mid +
                         (uint64_t)(uint32_t)b_x_a_mid +
                         (a_x_b_lo >> 32) ) >> 32;

uint64_t    multhi = a_x_b_hi +
                     (a_x_b_mid >> 32) + (b_x_a_mid >> 32) +
                     carry_bit;

return multhi;

As Yakk points out, if you don't mind being off by +1 in the upper 64 bits, you can omit the calculation of the carry bit.

How can I multiply 64 bit operands and get 128 bit result portably?

As I understand the question, you want a portable pure C implementation of 64 bit multiplication, with output to a 128 bit value, stored in two 64 bit values. In which case this article purports to have what you need. That code is written for C++. It doesn't take much to turn it into C code:

void mult64to128(uint64_t op1, uint64_t op2, uint64_t *hi, uint64_t *lo)
{
    uint64_t u1 = (op1 & 0xffffffff);
    uint64_t v1 = (op2 & 0xffffffff);
    uint64_t t = (u1 * v1);
    uint64_t w3 = (t & 0xffffffff);
    uint64_t k = (t >> 32);

    op1 >>= 32;
    t = (op1 * v1) + k;
    k = (t & 0xffffffff);
    uint64_t w1 = (t >> 32);

    op2 >>= 32;
    t = (u1 * op2) + k;
    k = (t >> 32);

    *hi = (op1 * op2) + w1 + k;
    *lo = (t << 32) + w3;
}

Fastest way to multiply two 64-bit ints to 128-bit then to 64-bit?

As a commenter on the question pointed out, this is most easily accomplished efficiently by machine-dependent code, rather than by portable code. The asker states that the main platform is x86_64, and that has a built-in instruction for performing 64 ✕ 64 → 128 bit multiplication. This is easily accessed using a small piece of inline assembly. Note that details of inline assembly may differ somewhat with compiler, the code below was built with the Intel C/C++ compiler.

#include <stdint.h>

/* compute mul_wide (a, b) >> s, for s in [0,63] */
int64_t mulshift (int64_t a, int64_t b, int s)
{
    int64_t res;
    __asm__ (
        "movq  %1, %%rax;\n\t"          // rax = a
        "movl  %3, %%ecx;\n\t"          // ecx = s
        "imulq %2;\n\t"                 // rdx:rax = a * b
        "shrdq %%cl, %%rdx, %%rax;\n\t" // rax = int64_t (rdx:rax >> s)
        "movq  %%rax, %0;\n\t"          // res = rax
        : "=rm" (res)
        : "rm"(a), "rm"(b), "rm"(s)
        : "%rax", "%rdx", "%ecx");
    return res;
}

A portable C99 equivalent to the above code is shown below. I have tested this extensively against the inline assembly version and no mismatches were found.

void umul64wide (uint64_t a, uint64_t b, uint64_t *hi, uint64_t *lo)
{
    uint64_t a_lo = (uint64_t)(uint32_t)a;
    uint64_t a_hi = a >> 32;
    uint64_t b_lo = (uint64_t)(uint32_t)b;
    uint64_t b_hi = b >> 32;

    uint64_t p0 = a_lo * b_lo;
    uint64_t p1 = a_lo * b_hi;
    uint64_t p2 = a_hi * b_lo;
    uint64_t p3 = a_hi * b_hi;

    uint32_t cy = (uint32_t)(((p0 >> 32) + (uint32_t)p1 + (uint32_t)p2) >> 32);

    *lo = p0 + (p1 << 32) + (p2 << 32);
    *hi = p3 + (p1 >> 32) + (p2 >> 32) + cy;
}

void mul64wide (int64_t a, int64_t b, int64_t *hi, int64_t *lo)
{
    umul64wide ((uint64_t)a, (uint64_t)b, (uint64_t *)hi, (uint64_t *)lo);
    if (a < 0LL) *hi -= b;
    if (b < 0LL) *hi -= a;
}

/* compute mul_wide (a, b) >> s, for s in [0,63] */
int64_t mulshift (int64_t a, int64_t b, int s)
{
    int64_t res;
    int64_t hi, lo;
    mul64wide (a, b, &hi, &lo);
    if (s) {
        res = ((uint64_t)hi << (64 - s)) | ((uint64_t)lo >> s);
    } else {
        res = lo;
    }
    return res;
}

Catch and compute overflow during multiplication of two large integers

1. Detecting the overflow:

x = a * b;
if (a != 0 && x / a != b) {
    // overflow handling
}

Edit: Fixed division by 0 (thanks Mark!)

2. Computing the carry is quite involved. One approach is to split both operands into half-words, then apply long multiplication to the half-words:

uint64_t hi(uint64_t x) {
    return x >> 32;
}

uint64_t lo(uint64_t x) {
    return ((1ULL << 32) - 1) & x;
}

void multiply(uint64_t a, uint64_t b) {
    // actually uint32_t would do, but the casting is annoying
    uint64_t s0, s1, s2, s3; 
    
    uint64_t x = lo(a) * lo(b);
    s0 = lo(x);
    
    x = hi(a) * lo(b) + hi(x);
    s1 = lo(x);
    s2 = hi(x);
    
    x = s1 + lo(a) * hi(b);
    s1 = lo(x);
    
    x = s2 + hi(a) * hi(b) + hi(x);
    s2 = lo(x);
    s3 = hi(x);
    
    uint64_t result = s1 << 32 | s0;
    uint64_t carry = s3 << 32 | s2;
}

To see that none of the partial sums themselves can overflow, we consider the worst case:

        x = s2 + hi(a) * hi(b) + hi(x)

Let B = 1 << 32. We then have

            x <= (B - 1) + (B - 1)(B - 1) + (B - 1)
              <= B*B - 1
               < B*B

I believe this will work - at least it handles Sjlver's test case. Aside from that, it is untested (and might not even compile, as I don't have a C++ compiler at hand anymore).

Reasonably Portable Way to Get Top 64-Bits from 64X64 Bit Multiply