How to Calculate Float Type Precision and Does It Make Sense

Is floating point math broken?

Binary floating point math is like this. In most programming languages, it is based on the IEEE 754 standard. The crux of the problem is that numbers are represented in this format as a whole number times a power of two; rational numbers (such as 0.1, which is 1/10) whose denominator is not a power of two cannot be exactly represented.

For 0.1 in the standard binary64 format, the representation can be written exactly as

  • 0.1000000000000000055511151231257827021181583404541015625 in decimal, or
  • 0x1.999999999999ap-4 in C99 hexfloat notation.

In contrast, the rational number 0.1, which is 1/10, can be written exactly as

  • 0.1 in decimal, or
  • 0x1.99999999999999...p-4 in an analogue of C99 hexfloat notation, where the ... represents an unending sequence of 9's.

The constants 0.2 and 0.3 in your program will also be approximations to their true values. It happens that the closest double to 0.2 is larger than the rational number 0.2 but that the closest double to 0.3 is smaller than the rational number 0.3. The sum of 0.1 and 0.2 winds up being larger than the rational number 0.3 and hence disagreeing with the constant in your code.

A fairly comprehensive treatment of floating-point arithmetic issues is What Every Computer Scientist Should Know About Floating-Point Arithmetic. For an easier-to-digest explanation, see floating-point-gui.de.

Side Note: All positional (base-N) number systems share this problem with precision

Plain old decimal (base 10) numbers have the same issues, which is why numbers like 1/3 end up as 0.333333333...

You've just stumbled on a number (3/10) that happens to be easy to represent with the decimal system, but doesn't fit the binary system. It goes both ways (to some small degree) as well: 1/16 is an ugly number in decimal (0.0625), but in binary it looks as neat as a 10,000th does in decimal (0.0001)** - if we were in the habit of using a base-2 number system in our daily lives, you'd even look at that number and instinctively understand you could arrive there by halving something, halving it again, and again and again.

** Of course, that's not exactly how floating-point numbers are stored in memory (they use a form of scientific notation). However, it does illustrate the point that binary floating-point precision errors tend to crop up because the "real world" numbers we are usually interested in working with are so often powers of ten - but only because we use a decimal number system day-to-day. This is also why we'll say things like 71% instead of "5 out of every 7" (71% is an approximation, since 5/7 can't be represented exactly with any decimal number).

So no: binary floating point numbers are not broken, they just happen to be as imperfect as every other base-N number system :)

Side Side Note: Working with Floats in Programming

In practice, this problem of precision means you need to use rounding functions to round your floating point numbers off to however many decimal places you're interested in before you display them.

You also need to replace equality tests with comparisons that allow some amount of tolerance, which means:

Do not do if (x == y) { ... }

Instead do if (abs(x - y) < myToleranceValue) { ... }.

where abs is the absolute value. myToleranceValue needs to be chosen for your particular application - and it will have a lot to do with how much "wiggle room" you are prepared to allow, and what the largest number you are going to be comparing may be (due to loss of precision issues). Beware of "epsilon" style constants in your language of choice. These are not to be used as tolerance values.

Difference between numeric, float and decimal in SQL Server


use the float or real data types only if the precision provided by decimal (up to 38 digits) is insufficient

  • Approximate numeric data types(see table 3.3) do not store the exact values specified for many numbers; they store an extremely close approximation of the value. (Technet)

  • Avoid using float or real columns in WHERE clause search conditions, especially the = and <> operators. It is best to limit float and real columns to > or < comparisons. (Technet)

so generally choosing Decimal as your data type is the best bet if

  • your number can fit in it. Decimal precision is 10E38[~ 38 digits]
  • smaller storage space (and maybe calculation speed) of Float is not important for you
  • exact numeric behavior is required, such as in financial applications, in operations involving rounding, or in equality checks. (Technet)


  1. Exact Numeric Data Types decimal and numeric - MSDN
  • numeric = decimal (5 to 17 bytes)
    • will map to Decimal in .NET
    • both have (18, 0) as default (precision,scale) parameters in SQL server
    • scale = maximum number of decimal digits that can be stored to the right of the decimal point.
    • money(8 byte) and smallmoney(4 byte) are also Exact Data Type and will map to Decimal in .NET and have 4 decimal points (MSDN)

  1. Approximate Numeric Data Types float and real - MSDN
  • real (4 byte)
    • will map to Single in .NET
    • The ISO synonym for real is float(24)
  • float (8 byte)
    • will map to Double in .NET

Exact Numeric Data Types
Approximate Numeric Data Types

  • All exact numeric types always produce the same result, regardless of which kind of processor architecture is being used or the magnitude of the numbers
  • The parameter supplied to the float data type defines the number of bits that are used to store the mantissa of the floating point number.
  • Approximate Numeric Data Type usually uses less storage and have better speed (up to 20x) and you should also consider when they got converted in .NET
  • What is the difference between Decimal, Float and Double in C#
  • Decimal vs Double Speed
  • SQL Server - .NET Data Type Mappings (From MSDN)

main source : MCTS Self-Paced Training Kit (Exam 70-433): Microsoft® SQL Server® 2008 Database Development - Chapter 3 - Tables, Data Types, and Declarative Data Integrity Lesson 1 - Choosing Data Types (Guidelines) - Page 93

Floating point dramatic error (fractional part is completely lost)

float has a precision of 6-9 digits. Your value is just too large to fit into a float without loss.

double has a precision of about 15-17 digits.

As an illustration, inspect the value of (int)43156414f or (double)43156414f - they are both 43156416

Largest value representable by a floating-point type smaller than 1

You can use the std::nextafter function, which, despite its name, can retrieve the next representable value that is arithmetically before a given starting point, by using an appropriate to argument. (Often -Infinity, 0, or +Infinity).

This works portably by definition of nextafter, regardless of what floating-point format your C++ implementation uses. (Binary vs. decimal, or width of mantissa aka significand, or anything else.)

Example: Retrieving the closest value less than 1 for the double type (on Windows, using the clang-cl compiler in Visual Studio 2019), the answer is different from the result of the 1 - ε calculation (which as discussed in comments, is incorrect for IEEE754 numbers; below any power of 2, representable numbers are twice as close together as above it):

#include <iostream>
#include <iomanip>
#include <cmath>
#include <limits>

int main()
{
double naft = std::nextafter(1.0, 0.0);
std::cout << std::fixed << std::setprecision(20);
std::cout << naft << '\n';
double neps = 1.0 - std::numeric_limits<double>::epsilon();
std::cout << neps << '\n';
return 0;
}

Output:

0.99999999999999988898
0.99999999999999977796

With different output formatting, this could print as 0x1.fffffffffffffp-1 and 0x1.ffffffffffffep-1 (1 - ε)


Note that, when using analogous techniques to determine the closest value that is greater than 1, then the nextafter(1.0, 10000.) call gives the same value as the 1 + ε calculation (1.00000000000000022204), as would be expected from the definition of ε.



Performance

C++23 requires std::nextafter to be constexpr, but currently only some compilers support that. GCC does do constant-propagation through it, but clang can't (Godbolt). If you want this to be as fast (with optimization enabled) as a literal constant like 0x1.fffffffffffffp-1; for systems where double is IEEE754 binary64, on some compilers you'll have to wait for that part of C++23 support. (It's likely that once compilers are able to do this, like GCC they'll optimize even without actually using -std=c++23.)

const double DoubleBelowOne = std::nextafter(1.0, 0.); at global scope will at worst run the function once at startup, defeating constant propagation where it's used, but otherwise performing about the same as FP literal constants when used with other runtime variables.

Is floating point precision mutable or invariant?

The precision is fixed, which is exactly 53 binary digits for double-precision (or 52 if we exclude the implicit leading 1). This comes out to about 15 decimal digits.


The OP asked me to elaborate on why having exactly 53 binary digits means "about" 15 decimal digits.

To understand this intuitively, let's consider a less-precise floating-point format: instead of a 52-bit mantissa like double-precision numbers have, we're just going to use a 4-bit mantissa.

So, each number will look like: (-1)s × 2yyy × 1.xxxx (where s is the sign bit, yyy is the exponent, and 1.xxxx is the normalised mantissa). For the immediate discussion, we'll focus only on the mantissa and not the sign or exponent.

Here's a table of what 1.xxxx looks like for all xxxx values (all rounding is half-to-even, just like how the default floating-point rounding mode works):

  xxxx  |  1.xxxx  |  value   |  2dd  |  3dd  
--------+----------+----------+-------+--------
0000 | 1.0000 | 1.0 | 1.0 | 1.00
0001 | 1.0001 | 1.0625 | 1.1 | 1.06
0010 | 1.0010 | 1.125 | 1.1 | 1.12
0011 | 1.0011 | 1.1875 | 1.2 | 1.19
0100 | 1.0100 | 1.25 | 1.2 | 1.25
0101 | 1.0101 | 1.3125 | 1.3 | 1.31
0110 | 1.0110 | 1.375 | 1.4 | 1.38
0111 | 1.0111 | 1.4375 | 1.4 | 1.44
1000 | 1.1000 | 1.5 | 1.5 | 1.50
1001 | 1.1001 | 1.5625 | 1.6 | 1.56
1010 | 1.1010 | 1.625 | 1.6 | 1.62
1011 | 1.1011 | 1.6875 | 1.7 | 1.69
1100 | 1.1100 | 1.75 | 1.8 | 1.75
1101 | 1.1101 | 1.8125 | 1.8 | 1.81
1110 | 1.1110 | 1.875 | 1.9 | 1.88
1111 | 1.1111 | 1.9375 | 1.9 | 1.94

How many decimal digits do you say that provides? You could say 2, in that each value in the two-decimal-digit range is covered, albeit not uniquely; or you could say 3, which covers all unique values, but do not provide coverage for all values in the three-decimal-digit range.

For the sake of argument, we'll say it has 2 decimal digits: the decimal precision will be the number of digits where all values of those decimal digits could be represented.


Okay, then, so what happens if we halve all the numbers (so we're using yyy = -1)?

  xxxx  |  1.xxxx  |  value    |  1dd  |  2dd  
--------+----------+-----------+-------+--------
0000 | 1.0000 | 0.5 | 0.5 | 0.50
0001 | 1.0001 | 0.53125 | 0.5 | 0.53
0010 | 1.0010 | 0.5625 | 0.6 | 0.56
0011 | 1.0011 | 0.59375 | 0.6 | 0.59
0100 | 1.0100 | 0.625 | 0.6 | 0.62
0101 | 1.0101 | 0.65625 | 0.7 | 0.66
0110 | 1.0110 | 0.6875 | 0.7 | 0.69
0111 | 1.0111 | 0.71875 | 0.7 | 0.72
1000 | 1.1000 | 0.75 | 0.8 | 0.75
1001 | 1.1001 | 0.78125 | 0.8 | 0.78
1010 | 1.1010 | 0.8125 | 0.8 | 0.81
1011 | 1.1011 | 0.84375 | 0.8 | 0.84
1100 | 1.1100 | 0.875 | 0.9 | 0.88
1101 | 1.1101 | 0.90625 | 0.9 | 0.91
1110 | 1.1110 | 0.9375 | 0.9 | 0.94
1111 | 1.1111 | 0.96875 | 1. | 0.97

By the same criteria as before, we're now dealing with 1 decimal digit. So you can see how, depending on the exponent, you can have more or less decimal digits, because binary and decimal floating-point numbers do not map cleanly to each other.

The same argument applies to double-precision floating point numbers (with the 52-bit mantissa), only in that case you're getting either 15 or 16 decimal digits depending on the exponent.



Related Topics



Leave a reply



Submit