Loss of Precision - Int -> Float or Double

How big is the precision loss converting long to double?

converting long (or any other data type representing a number) to double loses precision. This is quite obvious due to the representation of floating point numbers.

This is less obvious than it seems, because precision loss depends on the value of long. For values between -252 and 252 there is no precision loss at all.

How big is the loss of precision if I convert a larger number to double? Do I have to expect differences larger than +/- X

For numbers with magnitude above 252 you will experience some precision loss, depending on how much above the 52-bit limit you go. If the absolute value of your long fits in, say, 58 bits, then the magnitude of your precision loss will be 58-52=6 bits, or +/-64.

Would decimal be more appropriate for this task?

decimal has a different representation than double, and it uses a different base. Since you are planning to divide your number by "small numbers", different representations would give you different errors on division. Specifically, double will be better at handling division by powers of two (2, 4, 8, 16, etc.) because such division can be accomplished by subtracting from exponent, without touching the mantissa. Similarly, large decimals would suffer no loss of significant digits when divided by ten, hundred, etc.

Loss of precision for int to float conversion

is_cast_safe can be implemented with:

static const F One = 1;
F ULP = std::scalbn(One, std::ilogb(value) - std::numeric_limits<F>::digits + 1);
I U = std::max(ULP, One);
return value % U;

This sets ULP to the value of the least digit position in the result of converting value to F. ilogb returns the position (as an exponent of the floating-point radix) for the highest digit position, and subtracting one less than the number of digits adjusts to the lowest digit position. Then scalbn gives us the value of that position, which is the ULP.

Then value can be represented exactly in F if and only if it is a multiple of the ULP. To test that, we convert the ULP to I (but substitute 1 if it is less than 1), and then take the remainder of value divided by the ULP (or 1).

Also, if one is concerned the conversion to F might overflow, code can be inserted to handle this as well.

Calculating the actual amount of the change is trickier. The conversion to floating-point could round up or down, and the rule for choosing is implementation-defined, although round-to-nearest-ties-to-even is common. So the actual change cannot be calculated from the floating-point properties we are given in numeric_limits. It must involve performing the conversion and doing some work in floating-point. This definitely can be done, but it is a nuisance. I think an approach that should work is:

  • Assume value is non-negative. (Negative values can be handled similarly but are omitted for now for simplicity.)
  • First, test for overflow in conversion to F. This in itself is tricky, as the behavior is undefined if the value is too large. Some similar considerations were addressed in this answer to a question about safely converting from floating-point to integer (in C).
  • If the value does not overflow, then convert it. Let the result be x. Divide x by the floating-point radix r, producing y. If y is not an integer (which can be tested using fmod or trunc) the conversion was exact.
  • Otherwise, convert y to I, producing z. This is safe because y is less than the original value, so it must fit in I.
  • Then the error due to conversion is (z-value/r)*r + value%r.

Precision loss with java.lang.Double

I'm using a decimal floating point arithmetic with a precision of three decimal digits and (roughly) with the same features as the typical binary floating point arithmetic. Say you have 123.0 and 4.56. These numbers are represented by a mantissa (0<=m<1) and an exponent: 0.123*10^3 and 0.456*10^1, which I'll write as <.123e3> and <.456e1>. Adding two such numbers isn't immediately possible unless the exponents are equal, and that's why the addition proceeds according to:

 <.123e3>   <.123e3>
<.456e1> <.004e3>
--------
<.127e3>

You see that the necessary alignment of the decimal digits according to a common exponent produces a loss of precision. In the extreme case, the entire addend could be shifted into nothingness. (Think of summing an infinite series where the terms get smaller and smaller but would still contribute considerably to the sum being computed.)

Other sources of imprecision result from differences between binary and decimal fractions, where an exact fraction in one base cannot be represented without error using the other one.

So, in short, addition and subtraction between numbers from rather different orders of magnitude are bound to cause a loss of precision.

Lossy Conversion' vs. 'Loss of precision'

The difference is the end of the number which gets chopped off:

  • Lossy conversion returns least-significant bits. It is described in JLS Sec 5.1.3:

    A narrowing conversion of a signed integer to an integral type T simply discards all but the n lowest order bits, where n is the number of bits used to represent type T. In addition to a possible loss of information about the magnitude of the numeric value, this may cause the sign of the resulting value to differ from the sign of the input value.

    It is something like converting an int to a byte: you simply get the 8 least-significant bits in this case:

    System.out.println((byte) 258); // 2
  • Loss of precision returns most-significant bits. It is described in JLS Sec 5.1.2:

    A widening primitive conversion from int to float, or from long to float, or from long to double, may result in loss of precision - that is, the result may lose some of the least significant bits of the value.

    It is something like storing an int in a float which is too large to be represented accurately

    int i = (1 << 24) + 1;
    float f = i;
    System.out.println((int) f == i); // false, because precision is lost.

Why not use Double or Float to represent currency?

Because floats and doubles cannot accurately represent the base 10 multiples that we use for money. This issue isn't just for Java, it's for any programming language that uses base 2 floating-point types.

In base 10, you can write 10.25 as 1025 * 10-2 (an integer times a power of 10). IEEE-754 floating-point numbers are different, but a very simple way to think about them is to multiply by a power of two instead. For instance, you could be looking at 164 * 2-4 (an integer times a power of two), which is also equal to 10.25. That's not how the numbers are represented in memory, but the math implications are the same.

Even in base 10, this notation cannot accurately represent most simple fractions. For instance, you can't represent 1/3: the decimal representation is repeating (0.3333...), so there is no finite integer that you can multiply by a power of 10 to get 1/3. You could settle on a long sequence of 3's and a small exponent, like 333333333 * 10-10, but it is not accurate: if you multiply that by 3, you won't get 1.

However, for the purpose of counting money, at least for countries whose money is valued within an order of magnitude of the US dollar, usually all you need is to be able to store multiples of 10-2, so it doesn't really matter that 1/3 can't be represented.

The problem with floats and doubles is that the vast majority of money-like numbers don't have an exact representation as an integer times a power of 2. In fact, the only multiples of 0.01 between 0 and 1 (which are significant when dealing with money because they're integer cents) that can be represented exactly as an IEEE-754 binary floating-point number are 0, 0.25, 0.5, 0.75 and 1. All the others are off by a small amount. As an analogy to the 0.333333 example, if you take the floating-point value for 0.01 and you multiply it by 10, you won't get 0.1. Instead you will get something like 0.099999999786...

Representing money as a double or float will probably look good at first as the software rounds off the tiny errors, but as you perform more additions, subtractions, multiplications and divisions on inexact numbers, errors will compound and you'll end up with values that are visibly not accurate. This makes floats and doubles inadequate for dealing with money, where perfect accuracy for multiples of base 10 powers is required.

A solution that works in just about any language is to use integers instead, and count cents. For instance, 1025 would be $10.25. Several languages also have built-in types to deal with money. Among others, Java has the BigDecimal class, and Rust has the rust_decimal crate, and C# has the decimal type.

C++ - Converting long to float or double rounds the value

As already mentioned by Mark Ransom in the comments, the output that you are seeing is just a short form for the actual value that is stored in your float or double value. You can see more digits by using, i.e. std::setprecision(15).

Standard single-precision floating point values have 32 bits with 23 bits reserved for the mantissa. The most significant bit of the mantissa is assumed to be one (but not stored) when the floating point value is not zero. That means you have 24 bits for storage and the maximum value the mantissa can hold is 2^24 or 16777216. As you can see, you can store about 7 digits without losing precision. I say 'about' because not all decimal representations of a floating point value can be expressed with the same precision in binary format.

Here is an interesting experiment:

long n0 = 16777210;
for (int i = 0; i < 10; i++)
{
long n = n0 + i;
std::cout << "n=" << n << " / ((float)n)=" << std::setprecision(15) << ((float)n) << std::endl;
}

The output is:

n=16777210 / ((float)n)=16777210
n=16777211 / ((float)n)=16777211
n=16777212 / ((float)n)=16777212
n=16777213 / ((float)n)=16777213
n=16777214 / ((float)n)=16777214
n=16777215 / ((float)n)=16777215
n=16777216 / ((float)n)=16777216
n=16777217 / ((float)n)=16777216
n=16777218 / ((float)n)=16777218
n=16777219 / ((float)n)=16777220

The number 3012916000 is too large to be held exactly in a single precision floating point value. When you output your number like so:

std::cout << "x float = " << std::setprecision(15) << xFloat << std::endl;

Then the output is:

x float = 3012915968

Double values have a 52+1 bit mantissa and your number can therefore be stored exactly:

std::cout << "xDouble = " << std::setprecision(15) << xDouble << std::endl;

Output:

xDouble = 3012916000

Retain precision with double in Java

As others have mentioned, you'll probably want to use the BigDecimal class, if you want to have an exact representation of 11.4.

Now, a little explanation into why this is happening:

The float and double primitive types in Java are floating point numbers, where the number is stored as a binary representation of a fraction and a exponent.

More specifically, a double-precision floating point value such as the double type is a 64-bit value, where:

  • 1 bit denotes the sign (positive or negative).
  • 11 bits for the exponent.
  • 52 bits for the significant digits (the fractional part as a binary).

These parts are combined to produce a double representation of a value.

(Source: Wikipedia: Double precision)

For a detailed description of how floating point values are handled in Java, see the Section 4.2.3: Floating-Point Types, Formats, and Values of the Java Language Specification.

The byte, char, int, long types are fixed-point numbers, which are exact representions of numbers. Unlike fixed point numbers, floating point numbers will some times (safe to assume "most of the time") not be able to return an exact representation of a number. This is the reason why you end up with 11.399999999999 as the result of 5.6 + 5.8.

When requiring a value that is exact, such as 1.5 or 150.1005, you'll want to use one of the fixed-point types, which will be able to represent the number exactly.

As has been mentioned several times already, Java has a BigDecimal class which will handle very large numbers and very small numbers.

From the Java API Reference for the BigDecimal class:

Immutable,
arbitrary-precision signed decimal
numbers. A BigDecimal consists of an
arbitrary precision integer unscaled
value and a 32-bit integer scale. If
zero or positive, the scale is the
number of digits to the right of the
decimal point. If negative, the
unscaled value of the number is
multiplied by ten to the power of the
negation of the scale. The value of
the number represented by the
BigDecimal is therefore (unscaledValue
× 10^-scale).

There has been many questions on Stack Overflow relating to the matter of floating point numbers and its precision. Here is a list of related questions that may be of interest:

  • Why do I see a double variable initialized to some value like 21.4 as 21.399999618530273?
  • How to print really big numbers in C++
  • How is floating point stored? When does it matter?
  • Use Float or Decimal for Accounting Application Dollar Amount?

If you really want to get down to the nitty gritty details of floating point numbers, take a look at What Every Computer Scientist Should Know About Floating-Point Arithmetic.



Related Topics



Leave a reply



Submit