Floating Point VS Integer Calculations on Modern Hardware

Floating point vs integer calculations on modern hardware

Alas, I can only give you an "it depends" answer...

From my experience, there are many, many variables to performance...especially between integer & floating point math. It varies strongly from processor to processor (even within the same family such as x86) because different processors have different "pipeline" lengths. Also, some operations are generally very simple (such as addition) and have an accelerated route through the processor, and others (such as division) take much, much longer.

The other big variable is where the data reside. If you only have a few values to add, then all of the data can reside in cache, where they can be quickly sent to the CPU. A very, very slow floating point operation that already has the data in cache will be many times faster than an integer operation where an integer needs to be copied from system memory.

I assume that you are asking this question because you are working on a performance critical application. If you are developing for the x86 architecture, and you need extra performance, you might want to look into using the SSE extensions. This can greatly speed up single-precision floating point arithmetic, as the same operation can be performed on multiple data at once, plus there is a separate* bank of registers for the SSE operations. (I noticed in your second example you used "float" instead of "double", making me think you are using single-precision math).

*Note: Using the old MMX instructions would actually slow down programs, because those old instructions actually used the same registers as the FPU does, making it impossible to use both the FPU and MMX at the same time.

why are floating point operations considered expensive?

I'm going to assume you're talking about x86, but a lot of the below applies equally well to other architectures

Floating point operations are expensive because operations on floating point numbers are much more expensive than operations on integers. It's that simple. The format of integers makes addition and subtraction extremely simple to implement in hardware. Floating point numbers are (almost always) implemented in IEEE 754, which stores numbers as a sign, exponent, and mantissa, which allows for the representation of very large and very small numbers, but it comes at the cost of operation speed. If numbers only had 3 decimal places, you could use integers, and just divide by 3 at the end; the wide range of precision complicates things.

That being said, modern processors are better with floating point numbers than they used to be. Floating-point math was originally implemented on an optional coprocessor - the Intel 80387 in particular - and it could only be accessed with special instructions. You'd push values onto the x87 stack, perform an operation, and then pop it back into a hardware register. Very slow, because it had to leave the processor. Even more importantly, those particular operations became "risky" to use because you couldn't be sure the processor existed - if it didn't, your program would work, but it would use software routines that emulated the coprocessor. If you were a game developer and you couldn't rely on, say, inverse square root being fast, you could do the work yourself, and you'd run equally fast on all systems - not slightly faster on some, and much slower on others.

Nowadays, processors have special floating point operations that are designed for performance, and more importantly, guaranteed to be there. So they're very fast, and while floating point ops are unavoidably slower than integer ops, they're not usually enough of a problem to do anything about it - especially at the expense of bugs and complexity. What's more, this answer suggests that, in most cases, it's a wash.

In any case, performance is now good enough that the old adage kicks in - programmer time is more important than machine time, and you'll definitely spend a lot more time on programming some fancy algorithm avoiding floating point numbers than you would by just using them.

int vs float arithmetic efficiency in Java

As ever with this sort of thing you should set yourself some performance goals, and then profile the app to see if it meets them.

Often times you may find surprising results; that the time taken is hardly affected by base numerical type at all, or that your algorithm is suboptimal.

And regarding compiler optimisations - they're a real, and valid part of performance optimisation.

If using type A is theoretically faster than using type B, but your compiler can optimise type B to be quicker in a real scenario then thats a valuable piece of evidence, not source for dissapointment.

Does phone CPU have separate integer and floating point compute units that can operate in parallel?

After the x86 architecture has already been discussed in the comments, now about ARM:

Basically, this also depends on the processor model used. Most ARM processors have only two pipelines for SIMD calculations. Some instructions can only be executed on one of the two pipelines, but most do not care. This also applies to simple ALU operations such as

  • FADD, FSUB, FMUL for floating-point SIMD
  • ADD, SUB, MUL for integer SIMD

If this addition, for example, already has a throughput of (maximum) 2 instructions per cycle, this means that both pipelines are fully utilized. So here simple integer instructions are just as fast as floating point instructions. Due to the high throughput, no speed advantage can be achieved by using the pipelines for SIMD or even SISD integer operations instead. Here I assume, of course, that there are no dependencies between the instructions.

In addition to the throughput, the latency of the instructions must also be taken into account: The integer SIMD ADD has a maximum latency of 3 cycles, for floating-point FADD it is 4 cycles. On the other hand, the non-SIMD add only has one cycle latency. The latency indicates the number of cycles after which the result is available at the earliest. If the following instruction is based on the result of the previous one, the throughput is limited and it can be useful to put other instructions in between that use other pipelines, for example the non-SIMD ALU one.

At least that's the case with the Cortex-A72 and Cortex-A76. With the older Cortex-A55 it's a bit more complicated. You can find information in the respective "Software Optimization Guide", for example:

  • Arm® Cortex®-A55 Software Optimization Guide
  • Arm® Cortex®-A72 Software Optimization Guide
  • Arm® Cortex®-A76 Software Optimization Guide

Clarification after some comments: Scalar operations on SIMD registers (using s0 to s31, d0 to d31 etc.) and vector operations on them (v0 to v31) always take place on the two SIMD pipelines. Only operations on general-purpose registers (w0 to w30, wzr, wsp, x0 to x31, xzr, xsp) run on the two non-SIMD ALU pipelines I0/I1 and the M-Pipeline. That's why, in some cases, one ALU pipeline I0/I1 is also used for address calculation with SIMD instructions.

Why do certain floating point calculations turn the way they do? (e.g. 123456789f +1 = 123456792)

Typical modern processors and programming languages use IEEE-754 arithmetic (more or less) with 32-bit binary floating-point for float and 64-bit binary floating-point for double. In double, a 53-bit significand is used. This means that, when a decimal numeral is converted to double, it is converted to some number sf•2e, where s is a sign (+1 or −1), f is an unsigned integer that can be represented in 53 bits, and e is an integer between −1074 and 971, inclusive. (Or, if the number being converted is too large, the result can be +infinity or -infinity.) (Those who know the floating-point format may complain that the exponent is properly between −1023 and 1023, but I have shifted the significand to make it an integer. I am describing the mathematical value, not the encoding.)

Converting .1 to double yields 3602879701896397 / 36028797018963968, because, of all the numbers in the required form, that one is closest to .1. The denominator is 2−55, so e is −55.

When we add two of these, we get 7205759403792794 / 36028797018963968. That is fine, the numerator is still less than 253, so it fits in the format.

When we add a third 3602879701896397 / 36028797018963968, the mathematical result is 10808639105689191 / 36028797018963968. Unfortunately, the numerator is too large; it is larger than 253 (9007199254740992). So the floating-point hardware cannot return that number. It has to make it fit somehow.

If we divide the numerator and the denominator by two, we have 5404319552844595.5 / 18014398509481984. This has the same value, but the numerator is not an integer. To make it fit, the hardware rounds it to an integer. When the fraction is exactly 1/2, the rule is to round to make the result even, so the hardware returns 5404319552844596 / 18014398509481984.

Next, we take the current sum, 5404319552844596 / 18014398509481984, and add 3602879701896397 / 36028797018963968 again. This time, the sum is 7205759403792794.5 / 18014398509481984. In this case, the hardware rounds down, returning 7205759403792794 / 18014398509481984.

Then we add 7205759403792794 / 18014398509481984 and 3602879701896397 / 36028797018963968, and the sum is 9007199254740992.5 / 18014398509481984. Note that the numerator not only has a fraction but is larger than 253. So we have to reduce it again, which produces 4503599627370496.25 / 9007199254740992. Rounding the numerator to an integer produces 4503599627370496 / 9007199254740992.

That is exactly 1/2. At this point, the rounding errors have coincidentally canceled; adding .1 five times yields exactly .5.

When we add 4503599627370496 / 9007199254740992 and 3602879701896397 / 36028797018963968, the result is exactly 5404319552844595.25 / 9007199254740992. The hardware rounds down and returns 5404319552844595 / 9007199254740992.

Now you can see we are going to round down repeatedly. To add 3602879701896397 / 36028797018963968 to the accumulating sum, the hardware has to divide its numerator by four to make it match. That means the fraction part is always going to be .25, and it will be rounded down. So the next four sums are also rounded down. We end up with 9007199254740991 / 9007199254740992, which is just less than 1.

With float instead of double, the numerator has to fit in 24 bits, so it has to be less than 224 (16777216). So 123456789 is too big even before any arithmetic is done. It has to be expressed as 15432099 • 23, which is 123456792. The exact mathematical result of adding 1 is 15432099.125 • 23, and rounding that significand to an integer yields 15432099 • 23, so there is no change. But, if you add four, the result is 15432099.5 • 23, and that rounds to 15432100 • 23.

How to do floating point calculations with integers

You are saying emulation is too slow. I guess you mean emulation of floating point. The only remaining alternative if scaled integers are not sufficient, is fixed point math but it's not exactly fast either, even though it's much faster than emulated float.

Also, you are never going to escape the fact that with both scaled integers, and fixed point math, you are going to get less dynamic range than with floating point.

However, if your range is known in advance, the fixed point math implementation can be tuned for the range you need.

Here is an article on fixed point. The gist of the trick is deciding how to split the variable, how many bits for the low and high part of the number.

A full implementation of fixed point for C can be found here. (BSD license.) There are others.

Is fixed point math faster than floating point?

Fixed point is marginally useful on platforms that do not support any kind of decimal type of their own; for example, I implemented a 24-bit fixed point type for the PIC16F series microcontrollers (more on why I chose fixed point later).

However, almost every modern CPU supports floating point at the microcode or hardware level, so there isn't much need for fixed point.

Fixed point numbers are limited in the range they can represent - consider a 64-bit(32.32) fixed point vs. a 64-bit floating point: the 64-bit fixed point number has a decimal resolution of 1/(232), while the floating point number has a decimal resolution of up to 1/(253); the fixed point number can represent values as high as 231, while the floating point number can represent numbers up to 2223. And if you need more, most modern CPUs support 80-bit floating point values.

Of course, the biggest downfall of floating point is limited precision in extreme cases - e.g. in fixed point, it would require fewer bits to represent 9000000000000000000000000000000.00000000000000000000000000000002. Of course, with floating point, you get better precision for average uses of decimal arithmetic, and I have yet to see an application where decimal arithmetic is as extreme as the above example yet also does not overflow the equivalent fixed-point size.

The reason I implemented a fixed-point library for the PIC16F rather than use an existing floating point library was code size, not speed: the 16F88 has 384 bytes of usable RAM and room for 4095 instructions total. To add two fixed point numbers of predefined width, I inlined integer addition with carry-out in my code (the fixed point doesn't move anyway); to multiply two fixed point numbers, I used a simple shift-and-add function with extended 32-bit fixed point, even though that isn't the fastest multiplication approach, in order to save even more code.

So, when I had need of only one or two basic arithmetic operations, I was able to add them without using up all of the program storage. For comparison, a freely available floating point library on that platform was about 60% of the total storage on the device. In contrast, software floating point libraries are mostly just wrappers around a few arithmetic operations, and in my experience, they are mostly all-or-nothing, so cutting the code size in half because you only need half of the functions doesn't work so well.

Fixed point generally doesn't provide much of an advantage in speed though, because of its limited representation range: how many bits would you need to represent 1.7E+/-308 with 15 digits of precision, the same as a 64-bit double? If my calculations are correct, you'd need somewhere around 2020 bits. I'd bet the performance of that wouldn't be so good.

Thirty years ago, when hardware floating point was relatively rare, very special-purpose fixed-point (or even scaled integer) arithmetic could provide significant gains in performance over doing software-based floating point, but only if the allowable range of values could be efficiently represented with scaled-integer arithmetic (the original Doom used this approach when no coprocessor was available, such as on my 486sx-25 in 1992 - typing this on an overclocked hyperthreaded Core i7 running at 4.0GHz with a GeForce card that has over 1000 independent floating point compute units, it just seems wrong somehow, although I'm not sure which - the 486, or the i7...).

Floating point is more general purpose due to the range of values it can represent, and with it implemented in hardware on both CPUs and GPUs, it beats fixed point in every way, unless you really need more than 80-bit floating point precision at the expense of huge fixed-point sizes and very slow code.



Related Topics



Leave a reply



Submit