Difference Between Float and Double

What is the difference between float and double?

Huge difference.

As the name implies, a double has 2x the precision of float[1]. In general a double has 15 decimal digits of precision, while float has 7.

Here's how the number of digits are calculated:

double has 52 mantissa bits + 1 hidden bit: log(253)÷log(10) = 15.95 digits

float has 23 mantissa bits + 1 hidden bit: log(224)÷log(10) = 7.22 digits

This precision loss could lead to greater truncation errors being accumulated when repeated calculations are done, e.g.

float a = 1.f / 81;
float b = 0;
for (int i = 0; i < 729; ++ i)
b += a;
printf("%.7g\n", b); // prints 9.000023

while

double a = 1.0 / 81;
double b = 0;
for (int i = 0; i < 729; ++ i)
b += a;
printf("%.15g\n", b); // prints 8.99999999999996

Also, the maximum value of float is about 3e38, but double is about 1.7e308, so using float can hit "infinity" (i.e. a special floating-point number) much more easily than double for something simple, e.g. computing the factorial of 60.

During testing, maybe a few test cases contain these huge numbers, which may cause your programs to fail if you use floats.


Of course, sometimes, even double isn't accurate enough, hence we sometimes have long double[1] (the above example gives 9.000000000000000066 on Mac), but all floating point types suffer from round-off errors, so if precision is very important (e.g. money processing) you should use int or a fraction class.


Furthermore, don't use += to sum lots of floating point numbers, as the errors accumulate quickly. If you're using Python, use fsum. Otherwise, try to implement the Kahan summation algorithm.


[1]: The C and C++ standards do not specify the representation of float, double and long double. It is possible that all three are implemented as IEEE double-precision. Nevertheless, for most architectures (gcc, MSVC; x86, x64, ARM) float is indeed a IEEE single-precision floating point number (binary32), and double is a IEEE double-precision floating point number (binary64).

What exactly is the difference between 'float' and 'double' floating point storage types?

Floating point is a way to represent numbers that is used in computers, and is explained elsewhere, for example
here
and
here.
Double just uses more bits than float, so double has more precision and range.

In a CSV file, numbers are stored as text, for example using the five characters "27.37". Floating point is a way of representing numbers that is used internally by computers, so the numbers you have in your CSV file are not floating-point numbers at all. They are neither floats nor doubles.

When these numbers are to be processed by a computer, the text format is (typically) converted to an internal format, usually one of float and double. You can't decide which by looking at the text version of the number, since it in itself is neither. You have to decide based on the precision and speed you need. In most cases I would recommend to use double, since doubles have higher precision, and are very fast on typical modern computers. You can save some space, and sometimes gain some speed, by using float, but that is only needed in exceptional cases.

But, to contradict myself: In some cases you can look at a number written as text, and determine whether it can be stored as a float or a double. For example, if you find the number "0.333333333333333314829616256247390992939472198486328125" written just like that, it isn't a float or a double in itself, but this particular number, with all those decimals, can be stored as a double but not as a float. If stored as a float, with its fewer bits, it would be converted to "0.3333333432674407958984375".

MySQL: What's the difference between float and double?

They both represent floating point numbers. A FLOAT is for single-precision, while a DOUBLE is for double-precision numbers.

MySQL uses four bytes for single-precision values and eight bytes for double-precision values.

There is a big difference from floating point numbers and decimal (numeric) numbers, which you can use with the DECIMAL data type. This is used to store exact numeric data values, unlike floating point numbers, where it is important to preserve exact precision, for example with monetary data.

What's the difference between float and double?

float is 32-bit while double is 64-bit. A float has fewer significant digits than double.

A float value doesn't store enough to hold the 10 digits of your 10000000.01.

Also see Difference between float and double for more details. That is about C/C++ but it applies to Objective-C as well.

Difference between decimal, float and double in .NET?

float and double are floating binary point types (float is 32-bit; double is 64-bit). In other words, they represent a number like this:

10001.10010110011

The binary number and the location of the binary point are both encoded within the value.

decimal is a floating decimal point type. In other words, they represent a number like this:

12345.65789

Again, the number and the location of the decimal point are both encoded within the value – that's what makes decimal still a floating point type instead of a fixed point type.

The important thing to note is that humans are used to representing non-integers in a decimal form, and expect exact results in decimal representations; not all decimal numbers are exactly representable in binary floating point – 0.1, for example – so if you use a binary floating point value you'll actually get an approximation to 0.1. You'll still get approximations when using a floating decimal point as well – the result of dividing 1 by 3 can't be exactly represented, for example.

As for what to use when:

  • For values which are "naturally exact decimals" it's good to use decimal. This is usually suitable for any concepts invented by humans: financial values are the most obvious example, but there are others too. Consider the score given to divers or ice skaters, for example.

  • For values which are more artefacts of nature which can't really be measured exactly anyway, float/double are more appropriate. For example, scientific data would usually be represented in this form. Here, the original values won't be "decimally accurate" to start with, so it's not important for the expected results to maintain the "decimal accuracy". Floating binary point types are much faster to work with than decimals.

Float and double datatype in Java

The Wikipedia page on it is a good place to start.

To sum up:

  • float is represented in 32 bits, with 1 sign bit, 8 bits of exponent, and 23 bits of the significand (or what follows from a scientific-notation number: 2.33728*1012; 33728 is the significand).

  • double is represented in 64 bits, with 1 sign bit, 11 bits of exponent, and 52 bits of significand.

By default, Java uses double to represent its floating-point numerals (so a literal 3.14 is typed double). It's also the data type that will give you a much larger number range, so I would strongly encourage its use over float.

There may be certain libraries that actually force your usage of float, but in general - unless you can guarantee that your result will be small enough to fit in float's prescribed range, then it's best to opt with double.

If you require accuracy - for instance, you can't have a decimal value that is inaccurate (like 1/10 + 2/10), or you're doing anything with currency (for example, representing $10.33 in the system), then use a BigDecimal, which can support an arbitrary amount of precision and handle situations like that elegantly.

Should I use double or float?

If you want to know the true answer, you should read What Every Computer Scientist Should Know About Floating-Point Arithmetic.

In short, although double allows for higher precision in its representation, for certain calculations it would produce larger errors. The "right" choice is: use as much precision as you need but not more and choose the right algorithm.

Many compilers do extended floating point math in "non-strict" mode anyway (i.e. use a wider floating point type available in hardware, e.g. 80-bits and 128-bits floating), this should be taken into account as well. In practice, you can hardly see any difference in speed -- they are natives to hardware anyway.

Difference between double and float in floating point accuracy

The Cases of 0.8−0.7

In 0.8-0.7 == 0.1, none of the literals are exactly representable in double. The nearest representable values are 0.8000000000000000444089209850062616169452667236328125 for .8, 0.6999999999999999555910790149937383830547332763671875 for .7, and 0.1000000000000000055511151231257827021181583404541015625 for .1. When the first two are subtracted, the result is 0.100000000000000088817841970012523233890533447265625. As this is not equal to the third, 0.8-0.7 == 0.1 evaluates to false.

In (float)(0.8-0.7) == (float)(0.1), the result of 0.8-0.7 and 0.1 are each converted to float. The float value nearest to the former, 0.1000000000000000055511151231257827021181583404541015625, is 0.100000001490116119384765625. The float value nearest to the latter, 0.100000000000000088817841970012523233890533447265625, is 0.100000001490116119384765625. Since these are the same, (float)(0.8-0.7) == (float)(0.1) evaluates to true.

In (double)(0.8-0.7) == (double)(0.1), the result of 0.8-0.7 and 0.1 are each converted to double. Since they are already double, there is no effect, and the result is the same as for 0.8-0.7 == 0.1.

Notes

The C# specification, version 5.0 indicates that float and double are the IEEE-754 32-bit and 64-bit floating-point types. I do not see it explicitly state they are the binary floating-point formats rather than decimal formats, but the characteristics described make this evident. The specification also states that IEEE-754 arithmetic is generally used, with round-to-nearest (presumably round-to-nearest-ties-to-even), subject to the exception below.

The C# specification allows floating-point arithmetic to be performed with more precision than the nominal type. Clause 4.1.6 says “… Floating-point operations may be performed with higher precision than the result type of the operation…” This can complicate analysis of floating-point expressions in general, but it does not concern us in the instance of 0.8-0.7 == 0.1 because the only applicable operation is the subtraction of 0.7 from 0.8, and these numbers are in the same binade (have the same power of two in the floating-point representation), so the result of the subtraction is exactly representable and additional precision will not change the result. As long as the conversion of the source texts 0.8, 0.7, and 0.1 to double does not use extra precision and the cast to float produces a float with no extra precision, the results will be as stated above. (The C# standard says in clause 6.2.1 that a conversion from double to float yields a float value, although it does not explicitly state that no extra precision may be used at this point.)

Additional Cases

In 8-0.7 == 7.3, we have 8 for 8, 7.29999999999999982236431605997495353221893310546875 for 7.3, 0.6999999999999999555910790149937383830547332763671875 for 0.7, and 7.29999999999999982236431605997495353221893310546875 for 8-0.7, so the result is true.

Note that the additional precision allowed by the C# specification could affect the result of 8-0.7. A C# implementation that used extra precision for this operation could produce false for this case, as it would get a different result for 8-0.7.

In 18.01-0.7 == 17.31, we have 18.010000000000001563194018672220408916473388671875 for 18.01, 0.6999999999999999555910790149937383830547332763671875 for 0.7, 17.309999999999998721023075631819665431976318359375 for 17.31, and 17.31000000000000227373675443232059478759765625 for 18.01-0.7, so the result is false.

How is subtracting 8 difference from subtracting 18.01 if they both are subtracted by a floating point number?

18.01 is larger than 8 and requires a greater power of two in its floating-point representation. Similarly, the result of 18.01-0.7 is larger than that of 8-0.7. This means the bits in their significands (the fraction portion of the floating-point representation, which is scaled by the power of two) represent greater values, causing the rounding errors in the floating-point operations to be generally greater. In general, a floating-point format has a fixed span—there is a fixed distance from the high bit retained to the low bit retained. When you change to numbers with more bits on the left (high bits), some bits on the right (low bits) are pushed out, and the results change.



Related Topics



Leave a reply



Submit