Float VS Double Performance

double or float, which is faster?

Depends on what the native hardware does.

  • If the hardware is (or is like) x86 with legacy x87 math, float and double are both extended (for free) to an internal 80-bit format, so both have the same performance (except for cache footprint / memory bandwidth)

  • If the hardware implements both natively, like most modern ISAs (including x86-64 where SSE2 is the default for scalar FP math), then usually most FPU operations are the same speed for both. Double division and sqrt can be slower than float, as well as of course being significantly slower than multiply or add. (Float being smaller can mean fewer cache misses. And with SIMD, twice as many elements per vector for loops that vectorize).

  • If the hardware implements only double, then float will be slower if conversion to/from the native double format isn't free as part of float-load and float-store instructions.

  • If the hardware implements float only, then emulating double with it will cost even more time. In this case, float will be faster.

  • And if the hardware implements neither, and both have to be implemented in software. In this case, both will be slow, but double will be slightly slower (more load and store operations at the least).

The quote you mention is probably referring to the x86 platform, where the first case was given. But this doesn't hold true in general.

Also beware that x * 3.3 + y for float x,y will trigger promotion to double for both variables. This is not the hardware's fault, and you should avoid it by writing 3.3f to let your compiler make efficient asm that actually keeps numbers as floats if that's what you want.

Float vs Double Performance

On x86 processors, at least, float and double will each be converted to a 10-byte real by the FPU for processing. The FPU doesn't have separate processing units for the different floating-point types it supports.

The age-old advice that float is faster than double applied 100 years ago when most CPUs didn't have built-in FPUs (and few people had separate FPU chips), so most floating-point manipulation was done in software. On these machines (which were powered by steam generated by the lava pits), it was faster to use floats. Now the only real benefit to floats is that they take up less space (which only matters if you have millions of them).

Is using double faster than float?

There isn't a single "intel CPU", especially in terms of what operations are optimized with respect to others!, but most of them, at CPU level (specifically within the FPU), are such that the answer to your question:

are double operations just as fast or
faster than float operations for +, -,
*, and /?

is "yes" -- within the CPU, except for division and sqrt which are somewhat slower for double than for float. (Assuming your compiler uses SSE2 for scalar FP math, like all x86-64 compilers do, and some 32-bit compilers depending on options. Legacy x87 doesn't have different widths in registers, only in memory (it converts on load/store), so historically even sqrt and division were just as slow for double).

For example, Haswell has a divsd throughput of one per 8 to 14 cycles (data-dependent), but a divss (scalar single) throughput of one per 7 cycles. x87 fdiv is 8 to 18 cycle throughput. (Numbers from https://agner.org/optimize/. Latency correlates with throughput for division, but is higher than the throughput numbers.)

The float versions of many library functions like logf(float) and sinf(float) will also be faster than log(double) and sin(double), because they have many fewer bits of precision to get right. They can use polynomial approximations with fewer terms to get full precision for float vs. double


However, taking up twice the memory for each number clearly implies heavier load on the cache(s) and more memory bandwidth to fill and spill those cache lines from/to RAM; the time you care about performance of a floating-point operation is when you're doing a lot of such operations, so the memory and cache considerations are crucial.

@Richard's answer points out that there are also other ways to perform FP operations (the SSE / SSE2 instructions; good old MMX was integers-only), especially suitable for simple ops on lot of data ("SIMD", single instruction / multiple data) where each vector register can pack 4 single-precision floats or only 2 double-precision ones, so this effect will be even more marked.

In the end, you do have to benchmark, but my prediction is that for reasonable (i.e., large;-) benchmarks, you'll find advantage to sticking with single precision (assuming of course that you don't need the extra bits of precision!-).

In what situations is it better to use a float over a double in Java?

Since your question is mostly about performance, this article presents you with some specific calculations (keep in mind though that this article is specific to neural networks, and your calculations may be completely different to what they're doing in the article): http://web.archive.org/web/20150310213841/http://www.heatonresearch.com/content/choosing-between-java%E2%80%99s-float-and-double

Some of the relevant material from the link is reproduced here:

Both double and float can support relatively large numbers. The upper
and lower range are really not a consideration for neural networks.
Float can handle numbers between 1.40129846432481707e-45 to
3.40282346638528860e+38...Basically, float can handle about 7 decimal places. A double can handle about 16 decimal places.

Matrix multiplication is one of the most common mathematical
operations for neural network programming. By no means is it the only
operation, but it will provide a good benchmark. The following program
will be used to benchmark a double.

Skipping all the code, the table on the website shows that for a 100x100 matrix multiplication, they have a gain in performance of around 10% if they use doubles. For a 500x100 matrix multiplication, the performance loss because of using doubles is around 7%. And for a 1000x1000 matrix multiplication, that loss is around 17%.

For the small 100x100 matrix switching to float may actually decrease
performance. As the size of the matrix increases, the percent gain
increases. With a very large matrix the performance gain increases to
17%. 17% is worth considering.

Why are double preferred over float?

In my opinion the answers so far don't really get the right point across, so here's my crack at it.

The short answer is C++ developers use doubles over floats:

  • To avoid premature optimization when they don't understand the performance trade-offs well ("they have higher precision, why not?" Is the thought process)
  • Habit
  • Culture
  • To match library function signatures
  • To match simple-to-write floating point literals (you can write 0.0 instead of 0.0f)

It's true double may be as fast as a float for a single computation because most FPUs have a wider internal representation than either the 32-bit float or 64-bit double represent.

However that's only a small piece of the picture. Now-days operational optimizations don't mean anything if you're bottle necked on cache/memory bandwidth.

Here is why some developers seeking to optimize their code should look into using 32-bit floats over 64-bit doubles:

  • They fit in half the memory. Which is like having all your caches be twice as large. (big win!!!)
  • If you really care about performance you'll use SSE instructions. SSE instructions that operate on floating point values have different instructions for 32-bit and 64-bit floating point representations. The 32-bit versions can fit 4 values in the 128-bit register operands, but the 64-bit versions can only fit 2 values. In this scenario you can likely double your FLOPS by using floats over double because each instruction operates on twice as much data.

In general, there is a real lack of knowledge of how floating point numbers really work in the majority of developers I've encountered. So I'm not really surprised most developers blindly use double.

Are doubles faster than floats in C#?

The short answer is, "use whichever precision is required for acceptable results."

Your one guarantee is that operations performed on floating point data are done in at least the highest precision member of the expression. So multiplying two float's is done with at least the precision of float, and multiplying a float and a double would be done with at least double precision. The standard states that "[floating-point] operations may be performed with higher precision than the result type of the operation."

Given that the JIT for .NET attempts to leave your floating point operations in the precision requested, we can take a look at documentation from Intel for speeding up our operations. On the Intel platform your floating point operations may be done in an intermediate precision of 80 bits, and converted down to the precision requested.

From Intel's guide to C++ Floating-point Operations1 (sorry only have dead tree), they mention:

  • Use a single precision type (for example, float) unless the extra precision obtained through double or long double is required. Greater precision types increase memory size and bandwidth requirements.
    ...
  • Avoid mixed data type arithmetic expressions

That last point is important as you can slow yourself down with unnecessary casts to/from float and double, which result in JIT'd code which requests the x87 to cast away from its 80-bit intermediate format in between operations!

1. Yes, it says C++, but the C# standard plus knowledge of the CLR lets us know the information for C++ should be applicable in this instance.

What is the difference between float and double?

Huge difference.

As the name implies, a double has 2x the precision of float[1]. In general a double has 15 decimal digits of precision, while float has 7.

Here's how the number of digits are calculated:

double has 52 mantissa bits + 1 hidden bit: log(253)÷log(10) = 15.95 digits

float has 23 mantissa bits + 1 hidden bit: log(224)÷log(10) = 7.22 digits

This precision loss could lead to greater truncation errors being accumulated when repeated calculations are done, e.g.

float a = 1.f / 81;
float b = 0;
for (int i = 0; i < 729; ++ i)
b += a;
printf("%.7g\n", b); // prints 9.000023

while

double a = 1.0 / 81;
double b = 0;
for (int i = 0; i < 729; ++ i)
b += a;
printf("%.15g\n", b); // prints 8.99999999999996

Also, the maximum value of float is about 3e38, but double is about 1.7e308, so using float can hit "infinity" (i.e. a special floating-point number) much more easily than double for something simple, e.g. computing the factorial of 60.

During testing, maybe a few test cases contain these huge numbers, which may cause your programs to fail if you use floats.


Of course, sometimes, even double isn't accurate enough, hence we sometimes have long double[1] (the above example gives 9.000000000000000066 on Mac), but all floating point types suffer from round-off errors, so if precision is very important (e.g. money processing) you should use int or a fraction class.


Furthermore, don't use += to sum lots of floating point numbers, as the errors accumulate quickly. If you're using Python, use fsum. Otherwise, try to implement the Kahan summation algorithm.


[1]: The C and C++ standards do not specify the representation of float, double and long double. It is possible that all three are implemented as IEEE double-precision. Nevertheless, for most architectures (gcc, MSVC; x86, x64, ARM) float is indeed a IEEE single-precision floating point number (binary32), and double is a IEEE double-precision floating point number (binary64).

Java raytracing float vs double

There is pretty much no difference between float and double when it comes to speed of calculation, as far as desktop processors are the platform. The difference can only come from the increased memory bandwidth requirements because doubles require twice as much space.

Its different for GPU based calculations, those are more tailored for float and e.g. Nvidia GPU's break down considerably for double.

I'd go with a mixed approach; store data like polygons with float precision, but do all calculations in double. Small memory footprint, high precision - win-win.



Related Topics



Leave a reply



Submit