What Is the Performance Impact of Using Int64_T Instead of Int32_T on 32-Bit Systems

what is the performance impact of using int64_t instead of int32_t on 32-bit systems?

which factors influence performance of these operations? Probably the
compiler and compiler version; but does the operating system or the
CPU make/model influence this as well?

Mostly the processor architecture (and model - please read model where I mention processor architecture in this section). The compiler may have some influence, but most compilers do pretty well on this, so the processor architecture will have a bigger influence than the compiler.

The operating system will have no influence whatsoever (other than "if you change OS, you need to use a different type of compiler which changes what the compiler does" in some cases - but that's probably a small effect).

Will a normal 32-bit system use the 64-bit registers of modern CPUs?

This is not possible. If the system is in 32-bit mode, it will act as a 32-bit system, the extra 32-bits of the registers is completely invisible, just as it would be if the system was actually a "true 32-bit system".

which operations will be especially slow when emulated on 32-bit? Or which will have nearly no slowdown?

Addition and subtraction, is worse as these have to be done in sequence of two operations, and the second operation requires the first to have completed - this is not the case if the compiler is just producing two add operations on independent data.

Mulitplication will get a lot worse if the input parameters are actually 64-bits - so 2^35 * 83 is worse than 2^31 * 2^31, for example. This is due to the fact that the processor can produce a 32 x 32 bit multiply into a 64-bit result pretty well - some 5-10 clockcycles. But a 64 x 64 bit multiply requires a fair bit of extra code, so will take longer.

Division is a similar problem to multiplication - but here it's OK to take a 64-bit input on the one side, divide it by a 32-bit value and get a 32-bit value out. Since it's hard to predict when this will work, the 64-bit divide is probably nearly always slow.

The data will also take twice as much cache-space, which may impact the results. And as a similar consequence, general assignment and passing data around will take twice as long as a minimum, since there is twice as much data to operate on.

The compiler will also need to use more registers.

are there any existing benchmark results for using int64_t/uint64_t on 32-bit systems?

Probably, but I'm not aware of any. And even if there are, it would only be somewhat meaningful to you, since the mix of operations is HIGHLY critical to the speed of operations.

If performance is an important part of your application, then benchmark YOUR code (or some representative part of it). It doesn't really matter if Benchmark X gives 5%, 25% or 103% slower results, if your code is some completely different amount slower or faster under the same circumstances.

does anyone have own experience about this performance impact?

I've recompiled some code that uses 64-bit integers for 64-bit architecture, and found the performance improve by some substantial amount - as much as 25% on some bits of code.

Changing your OS to a 64-bit version of the same OS, would help, perhaps?

Edit:

Because I like to find out what the difference is in these sort of things, I have written a bit of code, and with some primitive template (still learning that bit - templates isn't exactly my hottest topic, I must say - give me bitfiddling and pointer arithmetics, and I'll (usually) get it right... )

Here's the code I wrote, trying to replicate a few common functons:

#include <iostream>
#include <cstdint>
#include <ctime>

using namespace std;

static __inline__ uint64_t rdtsc(void)
{
unsigned hi, lo;
__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
return ( (uint64_t)lo)|( ((uint64_t)hi)<<32 );
}

template<typename T>
static T add_numbers(const T *v, const int size)
{
T sum = 0;
for(int i = 0; i < size; i++)
sum += v[i];
return sum;
}

template<typename T, const int size>
static T add_matrix(const T v[size][size])
{
T sum[size] = {};
for(int i = 0; i < size; i++)
{
for(int j = 0; j < size; j++)
sum[i] += v[i][j];
}
T tsum=0;
for(int i = 0; i < size; i++)
tsum += sum[i];
return tsum;
}

template<typename T>
static T add_mul_numbers(const T *v, const T mul, const int size)
{
T sum = 0;
for(int i = 0; i < size; i++)
sum += v[i] * mul;
return sum;
}

template<typename T>
static T add_div_numbers(const T *v, const T mul, const int size)
{
T sum = 0;
for(int i = 0; i < size; i++)
sum += v[i] / mul;
return sum;
}

template<typename T>
void fill_array(T *v, const int size)
{
for(int i = 0; i < size; i++)
v[i] = i;
}

template<typename T, const int size>
void fill_array(T v[size][size])
{
for(int i = 0; i < size; i++)
for(int j = 0; j < size; j++)
v[i][j] = i + size * j;
}

uint32_t bench_add_numbers(const uint32_t v[], const int size)
{
uint32_t res = add_numbers(v, size);
return res;
}

uint64_t bench_add_numbers(const uint64_t v[], const int size)
{
uint64_t res = add_numbers(v, size);
return res;
}

uint32_t bench_add_mul_numbers(const uint32_t v[], const int size)
{
const uint32_t c = 7;
uint32_t res = add_mul_numbers(v, c, size);
return res;
}

uint64_t bench_add_mul_numbers(const uint64_t v[], const int size)
{
const uint64_t c = 7;
uint64_t res = add_mul_numbers(v, c, size);
return res;
}

uint32_t bench_add_div_numbers(const uint32_t v[], const int size)
{
const uint32_t c = 7;
uint32_t res = add_div_numbers(v, c, size);
return res;
}

uint64_t bench_add_div_numbers(const uint64_t v[], const int size)
{
const uint64_t c = 7;
uint64_t res = add_div_numbers(v, c, size);
return res;
}

template<const int size>
uint32_t bench_matrix(const uint32_t v[size][size])
{
uint32_t res = add_matrix(v);
return res;
}
template<const int size>
uint64_t bench_matrix(const uint64_t v[size][size])
{
uint64_t res = add_matrix(v);
return res;
}

template<typename T>
void runbench(T (*func)(const T *v, const int size), const char *name, T *v, const int size)
{
fill_array(v, size);

uint64_t long t = rdtsc();
T res = func(v, size);
t = rdtsc() - t;
cout << "result = " << res << endl;
cout << name << " time in clocks " << dec << t << endl;
}

template<typename T, const int size>
void runbench2(T (*func)(const T v[size][size]), const char *name, T v[size][size])
{
fill_array(v);

uint64_t long t = rdtsc();
T res = func(v);
t = rdtsc() - t;
cout << "result = " << res << endl;
cout << name << " time in clocks " << dec << t << endl;
}

int main()
{
// spin up CPU to full speed...
time_t t = time(NULL);
while(t == time(NULL)) ;

const int vsize=10000;

uint32_t v32[vsize];
uint64_t v64[vsize];

uint32_t m32[100][100];
uint64_t m64[100][100];

runbench(bench_add_numbers, "Add 32", v32, vsize);
runbench(bench_add_numbers, "Add 64", v64, vsize);

runbench(bench_add_mul_numbers, "Add Mul 32", v32, vsize);
runbench(bench_add_mul_numbers, "Add Mul 64", v64, vsize);

runbench(bench_add_div_numbers, "Add Div 32", v32, vsize);
runbench(bench_add_div_numbers, "Add Div 64", v64, vsize);

runbench2(bench_matrix, "Matrix 32", m32);
runbench2(bench_matrix, "Matrix 64", m64);
}

Compiled with:

g++ -Wall -m32 -O3 -o 32vs64 32vs64.cpp -std=c++0x

And the results are: Note: See 2016 results below - these results are slightly optimistic due to the difference in usage of SSE instructions in 64-bit mode, but no SSE usage in 32-bit mode.

result = 49995000
Add 32 time in clocks 20784
result = 49995000
Add 64 time in clocks 30358
result = 349965000
Add Mul 32 time in clocks 30182
result = 349965000
Add Mul 64 time in clocks 79081
result = 7137858
Add Div 32 time in clocks 60167
result = 7137858
Add Div 64 time in clocks 457116
result = 49995000
Matrix 32 time in clocks 22831
result = 49995000
Matrix 64 time in clocks 23823

As you can see, addition, and multiplication isn't that much worse. Division gets really bad. Interestingly, the matrix addition is not much difference at all.

And is it faster on 64-bit I hear some of you ask:
Using the same compiler options, just -m64 instead of -m32 - yupp, a lot faster:

result = 49995000
Add 32 time in clocks 8366
result = 49995000
Add 64 time in clocks 16188
result = 349965000
Add Mul 32 time in clocks 15943
result = 349965000
Add Mul 64 time in clocks 35828
result = 7137858
Add Div 32 time in clocks 50176
result = 7137858
Add Div 64 time in clocks 50472
result = 49995000
Matrix 32 time in clocks 12294
result = 49995000
Matrix 64 time in clocks 14733

Edit, update for 2016:
four variants, with and without SSE, in 32- and 64-bit mode of the compiler.

I'm typically using clang++ as my usual compiler these days. I tried compiling with g++ (but it would still be a different version than above, as I've updated my machine - and I have a different CPU too). Since g++ failed to compile the no-sse version in 64-bit, I didn't see the point in that. (g++ gives similar results anyway)

As a short table:

Test name      | no-sse 32 | no-sse 64 | sse 32 | sse 64 |
----------------------------------------------------------
Add uint32_t | 20837 | 10221 | 3701 | 3017 |
----------------------------------------------------------
Add uint64_t | 18633 | 11270 | 9328 | 9180 |
----------------------------------------------------------
Add Mul 32 | 26785 | 18342 | 11510 | 11562 |
----------------------------------------------------------
Add Mul 64 | 44701 | 17693 | 29213 | 16159 |
----------------------------------------------------------
Add Div 32 | 44570 | 47695 | 17713 | 17523 |
----------------------------------------------------------
Add Div 64 | 405258 | 52875 | 405150 | 47043 |
----------------------------------------------------------
Matrix 32 | 41470 | 15811 | 21542 | 8622 |
----------------------------------------------------------
Matrix 64 | 22184 | 15168 | 13757 | 12448 |

Full results with compile options.

$ clang++ -m32 -mno-sse 32vs64.cpp --std=c++11 -O2
$ ./a.out
result = 49995000
Add 32 time in clocks 20837
result = 49995000
Add 64 time in clocks 18633
result = 349965000
Add Mul 32 time in clocks 26785
result = 349965000
Add Mul 64 time in clocks 44701
result = 7137858
Add Div 32 time in clocks 44570
result = 7137858
Add Div 64 time in clocks 405258
result = 49995000
Matrix 32 time in clocks 41470
result = 49995000
Matrix 64 time in clocks 22184

$ clang++ -m32 -msse 32vs64.cpp --std=c++11 -O2
$ ./a.out
result = 49995000
Add 32 time in clocks 3701
result = 49995000
Add 64 time in clocks 9328
result = 349965000
Add Mul 32 time in clocks 11510
result = 349965000
Add Mul 64 time in clocks 29213
result = 7137858
Add Div 32 time in clocks 17713
result = 7137858
Add Div 64 time in clocks 405150
result = 49995000
Matrix 32 time in clocks 21542
result = 49995000
Matrix 64 time in clocks 13757

$ clang++ -m64 -msse 32vs64.cpp --std=c++11 -O2
$ ./a.out
result = 49995000
Add 32 time in clocks 3017
result = 49995000
Add 64 time in clocks 9180
result = 349965000
Add Mul 32 time in clocks 11562
result = 349965000
Add Mul 64 time in clocks 16159
result = 7137858
Add Div 32 time in clocks 17523
result = 7137858
Add Div 64 time in clocks 47043
result = 49995000
Matrix 32 time in clocks 8622
result = 49995000
Matrix 64 time in clocks 12448

$ clang++ -m64 -mno-sse 32vs64.cpp --std=c++11 -O2
$ ./a.out
result = 49995000
Add 32 time in clocks 10221
result = 49995000
Add 64 time in clocks 11270
result = 349965000
Add Mul 32 time in clocks 18342
result = 349965000
Add Mul 64 time in clocks 17693
result = 7137858
Add Div 32 time in clocks 47695
result = 7137858
Add Div 64 time in clocks 52875
result = 49995000
Matrix 32 time in clocks 15811
result = 49995000
Matrix 64 time in clocks 15168

Why class size increases when int64_t changes to int32_t

In your first example

int64_t first : 40;
int64_t second : 24;

Both first and second use the 64 bits of a single 64 bit integer. This causes the size of the class to be a single 64 bit integer. In the second example you have

int64_t first : 40;
int32_t second : 24;

Which is two separate bit fields being stored in two different chunks of memory. You use 40 bits of the 64 bit integer and then you use 24 bit of another 32 bit integer. This means you need at least 12 bytes(this example is using 8 bit bytes). Most likely the extra 4 bytes you see is padding to align the class on 64 bit boundaries.

As other answers and comments have pointed out this is implementation defined behavior and you can/will see different results on different implementations.

Intel(x86_64) 64 bit vs 32 bit integer arithmetic performance difference

In general 64-bit arithmetic is as fast as 32-bit, ignoring things like larger operands taking up more memory and BW (bandwidth), and on x86-64 addressing the full 64-bit registers requires longer instructions.

However, you have managed to hit one of the few exceptions to this rule, namely the div instruction for calculating divisions.

Accessing hi and low part of int64_t with int32_t

First I should note that int64_t is a C99 feature, but older C89 compilers often already have support for double-word operations via some extension types like long long or __int64. Check if it's the case of your old compiler, if not then check if your compiler has an extension to get the carry flag, like __builtin_addc() or __builtin_add_overflow(). If all failed go to the next step

Now %0 = %1 + %2; is not an assembly instruction in any architecture I know, but it looks more readable than the traditional mnemonic syntax. However you don't even need to use assembly for multiword additions/subtractions like this. It's very simple to do directly in C since

  • basic operations in 2's complement don't depend on the signness of the type, and
  • if an overflow occurs then the result will be smaller than the operands (in unsigned) which we can use to get the carry bit

Regarding the implementation, since your old compiler has no 64-bit type, there's no need to declare the union, and you can't do that either because int64_t wasn't declared before. You can just access the whole thing as a struct.

#if COMPILER_VERSION <= SOME_VERSION

typedef UINT64_T {
uint32_t h;
uint32_t l;
} uint64_t;

uint64_t add(uint64_t x, uint64_t y)
{
uint64_t z;
z.l = x.l + y.l; // add the low parts
z.h = x.h + y.h + (z.l < x.l); // add the high parts and carry
return z;
}

// ...

#else
uint64_t add(uint64_t x, uint64_t y)
{
return x + y;
}
#endif

t = add(2, 3);

If you need a signed type then a small change is needed

typedef INT64_T {
int32_t h;
uint32_t l;
} int64_t;

The add/sub/mul functions are still the same as the unsigned version

A smart modern compiler will recognize the z.l < x.l pattern and turn into add/adc pair in architectures that have them, so there's no comparison and/or branch there. If not then unfortunately you still need to fall back to inline assembly

See also

  • Multiword addition in C
  • Access the flags without inline assembly?
  • Efficient 128-bit addition using carry flag
  • An efficient way to do basic 128 bit integer calculations in C++?

Causes the use of 64 bit variables in 32 bit code a performance penalty

Manipulating values that are larger than registers normally needs more CPU-Cycles so yes this may have a large impact on performance but to be sure if that is relevant for your case you have to profile anyway.



Related Topics



Leave a reply



Submit