Printing Double Without Losing Precision

Printing double without losing precision

Don't print floating-point values in decimal if you don't want to lose precision. Even if you print enough digits to represent the number exactly, not all implementations have correctly-rounded conversions to/from decimal strings over the entire floating-point range, so you may still lose precision.

Use hexadecimal floating point instead. In C:

printf("%a\n", yourNumber);

C++0x provides the hexfloat manipulator for iostreams that does the same thing (on some platforms, using the std::hex modifier has the same result, but this is not a portable assumption).

Using hex floating point is preferred for several reasons.

First, the printed value is always exact. No rounding occurs in writing or reading a value formatted in this way. Beyond the accuracy benefits, this means that reading and writing such values can be faster with a well tuned I/O library. They also require fewer digits to represent values exactly.

Format a double without losing precision but with a minimum number of digits

After Frodyne comment, I was able to figure out a very simple and fast solution.
The C++17 std::to_chars function, by default, formats the floating point numbers to fulfill shortest round trip requirement. That mean that all distinct floating point numbers remain distinct after serialization, and the number of characters to format is minimized.
So the conversion can be written like this in standard C++17.

#include <charconv>
#include <string>

std::string doubleToString(double number)
{
char buffer[24];
std::to_chars_result err = std::to_chars(buffer, buffer+sizeof(buffer), value);
return std::string(buffer, err.ptr);
}

The great news from Microsoft lecture is that in addition to solve the shortest round-trip problem, the implementation in MSVC is blazing fast! It is based on the incredible Ryu algorithm.

The bad news is that as time of writing std::to_chars is only available for floating point numbers in the Microsoft tool chain. The implementations in Clang libc++ and GCC libstdc++ are for the moment limited to integer numbers.

How do I print a double value with full precision using cout?

You can set the precision directly on std::cout and use the std::fixed format specifier.

double d = 3.14159265358979;
cout.precision(17);
cout << "Pi: " << fixed << d << endl;

You can #include <limits> to get the maximum precision of a float or double.

#include <limits>

typedef std::numeric_limits< double > dbl;

double d = 3.14159265358979;
cout.precision(dbl::max_digits10);
cout << "Pi: " << d << endl;

C dynamically printf double, no loss of precision and no trailing zeroes

There's probably no easier way. It's a quite involved problem.

Your code isn't solving it right for several reasons:

  • Most practical implementations of floating-point arithmetic aren't decimal, they are binary. So, when you multiply a floating-point number by 10 or divide it by 10, you may lose precision (this depends on the number).
  • Even though the standard 64-bit IEEE-754 floating-point format reserves 53 bits for the mantissa, which is equivalent to floor(log10(2 ^ 53)) = 15 decimal digits, a valid number in this format may need up to some 1080 decimal digits in the fractional part when printed exactly, which is what you appear to be asking about.

One way of solving this is to use the %a format type specifier in snprintf(), which is going to print the floating-point value using hexadecimal digits for the mantissa and the C standard from 1999 guarantees that this will print all significant digits if the floating-point format is radix-2 (AKA base-2 or simply binary). So, with this you can obtain all the binary digits of the mantissa of the number. And from here you will be able to figure out how many decimal digits are in the fractional part.

Now, observe that:

1.00000 = 2+0 = 1.00000 (binary)

0.50000 = 2-1 = 0.10000

0.25000 = 2-2 = 0.01000

0.12500 = 2-3 = 0.00100

0.06250 = 2-4 = 0.00010

0.03125 = 2-5 = 0.00001

and so on.

You can clearly see here that a binary digit at i-th position to the right of the point in the binary representation produces the last non-zero decimal digit also in i-th position to the right of the point in the decimal representation.

So, if you know where the least significant non-zero bit is in a binary floating-point number, you can figure out how many decimal digits are needed to print the fractional part of the number exactly.

And this is what my program is doing.

Code:

// file: PrintFullFraction.c
//
// compile with gcc 4.6.2 or better:
// gcc -Wall -Wextra -std=c99 -O2 PrintFullFraction.c -o PrintFullFraction.exe
#include <limits.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <math.h>
#include <float.h>
#include <assert.h>

#if FLT_RADIX != 2
#error currently supported only FLT_RADIX = 2
#endif

int FractionalDigits(double d)
{
char buf[
1 + // sign, '-' or '+'
(sizeof(d) * CHAR_BIT + 3) / 4 + // mantissa hex digits max
1 + // decimal point, '.'
1 + // mantissa-exponent separator, 'p'
1 + // mantissa sign, '-' or '+'
(sizeof(d) * CHAR_BIT + 2) / 3 + // exponent decimal digits max
1 // string terminator, '\0'
];
int n;
char *pp, *p;
int e, lsbFound, lsbPos;

// convert d into "+/- 0x h.hhhh p +/- ddd" representation and check for errors
if ((n = snprintf(buf, sizeof(buf), "%+a", d)) < 0 ||
(unsigned)n >= sizeof(buf))
return -1;

//printf("{%s}", buf);

// make sure the conversion didn't produce something like "nan" or "inf"
// instead of "+/- 0x h.hhhh p +/- ddd"
if (strstr(buf, "0x") != buf + 1 ||
(pp = strchr(buf, 'p')) == NULL)
return 0;

// extract the base-2 exponent manually, checking for overflows
e = 0;
p = pp + 1 + (pp[1] == '-' || pp[1] == '+'); // skip the exponent sign at first
for (; *p != '\0'; p++)
{
if (e > INT_MAX / 10)
return -2;
e *= 10;
if (e > INT_MAX - (*p - '0'))
return -2;
e += *p - '0';
}
if (pp[1] == '-') // apply the sign to the exponent
e = -e;

//printf("[%s|%d]", buf, e);

// find the position of the least significant non-zero bit
lsbFound = lsbPos = 0;
for (p = pp - 1; *p != 'x'; p--)
{
if (*p == '.')
continue;
if (!lsbFound)
{
int hdigit = (*p >= 'a') ? (*p - 'a' + 10) : (*p - '0'); // assuming ASCII chars
if (hdigit)
{
static const int lsbPosInNibble[16] = { 0,4,3,4, 2,4,3,4, 1,4,3,4, 2,4,3,4 };
lsbFound = 1;
lsbPos = -lsbPosInNibble[hdigit];
}
}
else
{
lsbPos -= 4;
}
}
lsbPos += 4;

if (!lsbFound)
return 0; // d is 0 (integer)

// adjust the least significant non-zero bit position
// by the base-2 exponent (just add them), checking
// for overflows

if (lsbPos >= 0 && e >= 0)
return 0; // lsbPos + e >= 0, d is integer

if (lsbPos < 0 && e < 0)
if (lsbPos < INT_MIN - e)
return -2; // d isn't integer and needs too many fractional digits

if ((lsbPos += e) >= 0)
return 0; // d is integer

if (lsbPos == INT_MIN && -INT_MAX != INT_MIN)
return -2; // d isn't integer and needs too many fractional digits

return -lsbPos;
}

const double testData[] =
{
0,
1, // 2 ^ 0
0.5, // 2 ^ -1
0.25, // 2 ^ -2
0.125,
0.0625, // ...
0.03125,
0.015625,
0.0078125, // 2 ^ -7
1.0/256, // 2 ^ -8
1.0/256/256, // 2 ^ -16
1.0/256/256/256, // 2 ^ -24
1.0/256/256/256/256, // 2 ^ -32
1.0/256/256/256/256/256/256/256/256, // 2 ^ -64
3.14159265358979323846264338327950288419716939937510582097494459,
0.1,
INFINITY,
#ifdef NAN
NAN,
#endif
DBL_MIN
};

int main(void)
{
unsigned i;
for (i = 0; i < sizeof(testData) / sizeof(testData[0]); i++)
{
int digits = FractionalDigits(testData[i]);
assert(digits >= 0);
printf("%f %e %.*f\n", testData[i], testData[i], digits, testData[i]);
}
return 0;
}

Output (ideone):

0.000000 0.000000e+00 0
1.000000 1.000000e+00 1
0.500000 5.000000e-01 0.5
0.250000 2.500000e-01 0.25
0.125000 1.250000e-01 0.125
0.062500 6.250000e-02 0.0625
0.031250 3.125000e-02 0.03125
0.015625 1.562500e-02 0.015625
0.007812 7.812500e-03 0.0078125
0.003906 3.906250e-03 0.00390625
0.000015 1.525879e-05 0.0000152587890625
0.000000 5.960464e-08 0.000000059604644775390625
0.000000 2.328306e-10 0.00000000023283064365386962890625
0.000000 5.421011e-20 0.0000000000000000000542101086242752217003726400434970855712890625
3.141593 3.141593e+00 3.141592653589793115997963468544185161590576171875
0.100000 1.000000e-01 0.1000000000000000055511151231257827021181583404541015625
inf inf inf
nan nan nan
0.000000 2.225074e

You can see that π and 0.1 are only true up to 15 decimal digits and the rest of the digits show what the numbers got really rounded to since these numbers cannot be represented exactly in a binary floating-point format.

You can also see that DBL_MIN, the smallest positive normalized double value, has 1022 digits in the fractional part and of those there are 715 significant digits.

Possible issues with this solution:

  • Your compiler's printf() functions do not support %a or do not correctly print all digits requested by the precision (this is quite possible).
  • Your computer uses non-binary floating-point formats (this is extremely rare).

Convert float to double without losing precision

It's not that you're actually getting extra precision - it's that the float didn't accurately represent the number you were aiming for originally. The double is representing the original float accurately; toString is showing the "extra" data which was already present.

For example (and these numbers aren't right, I'm just making things up) suppose you had:

float f = 0.1F;
double d = f;

Then the value of f might be exactly 0.100000234523. d will have exactly the same value, but when you convert it to a string it will "trust" that it's accurate to a higher precision, so won't round off as early, and you'll see the "extra digits" which were already there, but hidden from you.

When you convert to a string and back, you're ending up with a double value which is closer to the string value than the original float was - but that's only good if you really believe that the string value is what you really wanted.

Are you sure that float/double are the appropriate types to use here instead of BigDecimal? If you're trying to use numbers which have precise decimal values (e.g. money), then BigDecimal is a more appropriate type IMO.

Safely convert `float` to `double` without loss of precision

Based on my reading this is a bug in the implementation?

No. It's a bug in your expectations. The double value you're seeing is exactly the same value as the float value. The precise value is 13.8999996185302734375.

That's not the same as "the closest double value to 13.9" which is 13.9000000000000003552713678800500929355621337890625.

You're assigning the value 13.8999996185302734375 to a double value, and then printing the string representation - which is 13.899999618530273 as that's enough precision to completely distinguish it from other double values. If it were to print 13.9, that would be a bug, as there's a double value that's closer to 13.9, namely 13.9000000000000003552713678800500929355621337890625.

Is there any way to not lose the precision and still get the value?

float only guarantees 6 decimal digits of precision, so any computation with a float (even if the other operands are double, even if you're storing the result to a double) will only be precise to 6 digits.

If you need greater precision, then limit yourself to double or long double. If you need more than 10 decimal digits of precision, then you'll need to use something other than the native floating point types and library functions. You'll either need to roll your own, or use an arbitrary precision math library like GNU MP.

convert char * to double,without losing precision in c

First of all, you only need %f format specifier to print a double.

Then, you're not losing precision by making use of strtod(), the output representation is the problem with how you use printf() with %f format specifier.

As per C11, chapter §7.21.6.1, fprintf()

[...] If the
precision is missing, it is taken as 6;[...]

Next, when you did

printf("%6.7f",ld);

the precision became 7 and it outputs the value you expect to see.



Related Topics



Leave a reply



Submit