Should I Use Double or Float

Should I use double or float?

If you want to know the true answer, you should read What Every Computer Scientist Should Know About Floating-Point Arithmetic.

In short, although double allows for higher precision in its representation, for certain calculations it would produce larger errors. The "right" choice is: use as much precision as you need but not more and choose the right algorithm.

Many compilers do extended floating point math in "non-strict" mode anyway (i.e. use a wider floating point type available in hardware, e.g. 80-bits and 128-bits floating), this should be taken into account as well. In practice, you can hardly see any difference in speed -- they are natives to hardware anyway.

Why would you use float over double, or double over long double?

In nearly all processors, "smaller" floating point numbers take the same or less clock-cycles in execution. Sometimes the difference isn't very big (or nothing), other times it can be literally twice the number of cycles for double vs. float.

Of course, memory foot-print, which is affecting cache-usage, will also be a factor. float takes half the size of double, and long double is bigger yet.

Edit: Another side-effect of smaller size is that the processor's SIMD extensions (3DNow!, SSE, AVX in x86, and similar extensions are available in several other architectures) may either only work with float, or can take twice as many float vs. double (and as far as I know, no SIMD instructions are available for long double in any processor). So this may improve performance if float is used vs. double, by processing twice as much data in one go. End edit.

So, assuming 6-7 digits of precision is good enough for what you need, and the range of +/-10+/-38 is sufficient, then float should be used. If you need either more digits in the number, or a bigger range, move to double, and if that's not good enough, use long double. But for most things, double should be perfectly adequate.

Obviously, the importance of using "the right size" becomes more important when you have either lots of calculations, or lots of data to work with - if there are 5 variables, and you just use each a couple of times in a program that does a million other things, who cares? If you are doing fluid dynamics calculations for how well a Formula 1 car is doing at 200 mph, then you probably have several tens of million datapoints to calculate, and every data point needs to be calculated dozens of times per second of the cars travel, then using up just a few clockcycles extra in each calculation will make the whole simulation take noticeably longer.

Why not use Double or Float to represent currency?

Because floats and doubles cannot accurately represent the base 10 multiples that we use for money. This issue isn't just for Java, it's for any programming language that uses base 2 floating-point types.

In base 10, you can write 10.25 as 1025 * 10-2 (an integer times a power of 10). IEEE-754 floating-point numbers are different, but a very simple way to think about them is to multiply by a power of two instead. For instance, you could be looking at 164 * 2-4 (an integer times a power of two), which is also equal to 10.25. That's not how the numbers are represented in memory, but the math implications are the same.

Even in base 10, this notation cannot accurately represent most simple fractions. For instance, you can't represent 1/3: the decimal representation is repeating (0.3333...), so there is no finite integer that you can multiply by a power of 10 to get 1/3. You could settle on a long sequence of 3's and a small exponent, like 333333333 * 10-10, but it is not accurate: if you multiply that by 3, you won't get 1.

However, for the purpose of counting money, at least for countries whose money is valued within an order of magnitude of the US dollar, usually all you need is to be able to store multiples of 10-2, so it doesn't really matter that 1/3 can't be represented.

The problem with floats and doubles is that the vast majority of money-like numbers don't have an exact representation as an integer times a power of 2. In fact, the only multiples of 0.01 between 0 and 1 (which are significant when dealing with money because they're integer cents) that can be represented exactly as an IEEE-754 binary floating-point number are 0, 0.25, 0.5, 0.75 and 1. All the others are off by a small amount. As an analogy to the 0.333333 example, if you take the floating-point value for 0.01 and you multiply it by 10, you won't get 0.1. Instead you will get something like 0.099999999786...

Representing money as a double or float will probably look good at first as the software rounds off the tiny errors, but as you perform more additions, subtractions, multiplications and divisions on inexact numbers, errors will compound and you'll end up with values that are visibly not accurate. This makes floats and doubles inadequate for dealing with money, where perfect accuracy for multiples of base 10 powers is required.

A solution that works in just about any language is to use integers instead, and count cents. For instance, 1025 would be $10.25. Several languages also have built-in types to deal with money. Among others, Java has the BigDecimal class, and Rust has the rust_decimal crate, and C# has the decimal type.

Float and double datatype in Java

The Wikipedia page on it is a good place to start.

To sum up:

  • float is represented in 32 bits, with 1 sign bit, 8 bits of exponent, and 23 bits of the significand (or what follows from a scientific-notation number: 2.33728*1012; 33728 is the significand).

  • double is represented in 64 bits, with 1 sign bit, 11 bits of exponent, and 52 bits of significand.

By default, Java uses double to represent its floating-point numerals (so a literal 3.14 is typed double). It's also the data type that will give you a much larger number range, so I would strongly encourage its use over float.

There may be certain libraries that actually force your usage of float, but in general - unless you can guarantee that your result will be small enough to fit in float's prescribed range, then it's best to opt with double.

If you require accuracy - for instance, you can't have a decimal value that is inaccurate (like 1/10 + 2/10), or you're doing anything with currency (for example, representing $10.33 in the system), then use a BigDecimal, which can support an arbitrary amount of precision and handle situations like that elegantly.

Why are double preferred over float?

In my opinion the answers so far don't really get the right point across, so here's my crack at it.

The short answer is C++ developers use doubles over floats:

  • To avoid premature optimization when they don't understand the performance trade-offs well ("they have higher precision, why not?" Is the thought process)
  • Habit
  • Culture
  • To match library function signatures
  • To match simple-to-write floating point literals (you can write 0.0 instead of 0.0f)

It's true double may be as fast as a float for a single computation because most FPUs have a wider internal representation than either the 32-bit float or 64-bit double represent.

However that's only a small piece of the picture. Now-days operational optimizations don't mean anything if you're bottle necked on cache/memory bandwidth.

Here is why some developers seeking to optimize their code should look into using 32-bit floats over 64-bit doubles:

  • They fit in half the memory. Which is like having all your caches be twice as large. (big win!!!)
  • If you really care about performance you'll use SSE instructions. SSE instructions that operate on floating point values have different instructions for 32-bit and 64-bit floating point representations. The 32-bit versions can fit 4 values in the 128-bit register operands, but the 64-bit versions can only fit 2 values. In this scenario you can likely double your FLOPS by using floats over double because each instruction operates on twice as much data.

In general, there is a real lack of knowledge of how floating point numbers really work in the majority of developers I've encountered. So I'm not really surprised most developers blindly use double.

What is the difference between float and double?

Huge difference.

As the name implies, a double has 2x the precision of float[1]. In general a double has 15 decimal digits of precision, while float has 7.

Here's how the number of digits are calculated:

double has 52 mantissa bits + 1 hidden bit: log(253)÷log(10) = 15.95 digits

float has 23 mantissa bits + 1 hidden bit: log(224)÷log(10) = 7.22 digits

This precision loss could lead to greater truncation errors being accumulated when repeated calculations are done, e.g.

float a = 1.f / 81;
float b = 0;
for (int i = 0; i < 729; ++ i)
b += a;
printf("%.7g\n", b); // prints 9.000023

while

double a = 1.0 / 81;
double b = 0;
for (int i = 0; i < 729; ++ i)
b += a;
printf("%.15g\n", b); // prints 8.99999999999996

Also, the maximum value of float is about 3e38, but double is about 1.7e308, so using float can hit "infinity" (i.e. a special floating-point number) much more easily than double for something simple, e.g. computing the factorial of 60.

During testing, maybe a few test cases contain these huge numbers, which may cause your programs to fail if you use floats.


Of course, sometimes, even double isn't accurate enough, hence we sometimes have long double[1] (the above example gives 9.000000000000000066 on Mac), but all floating point types suffer from round-off errors, so if precision is very important (e.g. money processing) you should use int or a fraction class.


Furthermore, don't use += to sum lots of floating point numbers, as the errors accumulate quickly. If you're using Python, use fsum. Otherwise, try to implement the Kahan summation algorithm.


[1]: The C and C++ standards do not specify the representation of float, double and long double. It is possible that all three are implemented as IEEE double-precision. Nevertheless, for most architectures (gcc, MSVC; x86, x64, ARM) float is indeed a IEEE single-precision floating point number (binary32), and double is a IEEE double-precision floating point number (binary64).

When should I use double instead of decimal?

I think you've summarised the advantages quite well. You are however missing one point. The decimal type is only more accurate at representing base 10 numbers (e.g. those used in currency/financial calculations). In general, the double type is going to offer at least as great precision (someone correct me if I'm wrong) and definitely greater speed for arbitrary real numbers. The simple conclusion is: when considering which to use, always use double unless you need the base 10 accuracy that decimal offers.

Edit:

Regarding your additional question about the decrease in accuracy of floating-point numbers after operations, this is a slightly more subtle issue. Indeed, precision (I use the term interchangeably for accuracy here) will steadily decrease after each operation is performed. This is due to two reasons:

  1. the fact that certain numbers (most obviously decimals) can't be truly represented in floating point form
  2. rounding errors occur, just as if you were doing the calculation by hand. It depends greatly on the context (how many operations you're performing) whether these errors are significant enough to warrant much thought however.

In all cases, if you want to compare two floating-point numbers that should in theory be equivalent (but were arrived at using different calculations), you need to allow a certain degree of tolerance (how much varies, but is typically very small).

For a more detailed overview of the particular cases where errors in accuracies can be introduced, see the Accuracy section of the Wikipedia article. Finally, if you want a seriously in-depth (and mathematical) discussion of floating-point numbers/operations at machine level, try reading the oft-quoted article What Every Computer Scientist Should Know About Floating-Point Arithmetic.



Related Topics



Leave a reply



Submit