C++ Calculating More Precise Than Double or Long Double

C++ calculating more precise than double or long double

You will need to perform the calculation using some other method than floating point. There are libraries for doing "long math" such as GMP.

If that's not what you're looking for, you can also write code to do this yourself. The simplest way is to just use a string, and store a digit per character. Do the math just like you would do if you did it by hand on paper. Adding numbers together is relatively easy, so is subtracting. Doing multiplication and division is a little harder.

For non-integer numbers, you'll need to make sure you line up the decimal point for add/subtract...

It's a good learning experience to write that, but don't expect it to be something you knock up in half an hour without much thought [add and subtract, perhaps!]

More Precise Floating point Data Types than double?

According to Wikipedia, 80-bit "Intel" IEEE 754 extended-precision long double, which is 80 bits padded to 16 bytes in memory, has 64 bits mantissa, with no implicit bit, which gets you 19.26 decimal digits. This has been the almost universal standard for long double for ages, but recently things have started to change.

The newer 128-bit quad-precision format has 112 mantissa bits plus an implicit bit, which gets you 34 decimal digits. GCC implements this as the __float128 type and there is (if memory serves) a compiler option to set long double to it.

c++ double more accuracy/precision

You should take a look at third party libraries like boost.multiprecision or even GMP.

You can also do it "by hand" but that would be a lot of work. You would have to keep numbers as their string representation and manually make the arithmetic operations yourself.

how to use more precision in c++

There's more precision in that number than you're displaying. You just have to ask for it:

cout << std::setprecision(40) << pi << endl;

That gives me 3.141592653543740176758092275122180581093 when run on Codepad.

As double should have way more than enough precision for basic calculations. If you need to compute millions of places, there isn't a standard floating point representation big enough to handle that.

A more accurate data type than float or double? C++

In some compilers, and on some architectures, "long double" will give give you more precision than double. If you are on an x86 platform the x87 FPU has an "extended" 80-bit floating point format. Gcc and Borland compilers give you an 80 bit float value when you use the "long double" type. Note that Visual Studio does not support this (the maximum supported by MSVC is double precision, 64 bit).

There is something called a "double double" which is a software technique for implementing quad-precision 128-bit floating point. You can find libraries that implement it.

You could also investigate libraries for arbitrary precision arithmetic.

For some calculations a 64 bit integer is a better choice than a 64 bit floating point value.

But if your question is about built-in types in current C++ compilers on common platforms then the answer is that you're limited to double (64 bit floating point), and on 64 bit platforms you have 64 bit ints. If you can stick to x86 and use the right compiler you can also have long double (80-bit extended precision).

You might be interested in this question:

long double (GCC specific) and __float128

Float is more precise than double?

float is not more precise than double, and your float computation has not given you the exact result of pow(4,-59)/3.

What's going on is that your recurrence is designed to take a tiny rounding error and amplify it every iteration. In exact math, each value should be exactly one quarter of the previous value, but if it's not exactly a quarter due to rounding error, the difference gets magnified on every step.

Since a quarter of a representable value is always representable (until you hit subnormal numbers and underflow issues), the recurrence has an additional property: if the computation is performed in a precision sufficiently in excess of the precision with which the results are stored, then rounding the results to lower precision for storage will round to exactly a quarter of the previous value. (The choice of 9./4 and 1./2 factors give the recurrence an even stronger version of this property, where the result is exactly a quarter of the old value even before rounding for storage.)

With doubles, with the compiler and compiler settings you're using, the rounding error occurs and gets amplified. With floats, the computations are performed in double precision, eliminating rounding error in the recurrence steps due to the properties described above, so there is nothing to amplify. If the computation for doubles had been performed at long double precision, the same thing would have happened.

Let's take a closer look at the exact values produced, by using the %a format specifier to print floating-point numbers in hexadecimal notation. That looks like 0x1.5555555555558p-6, where the part between 0x and p is a hexadecimal number, and the part after the p is a decimal number representing a power of two to multiply the hexadecimal number by. Here, 0x1.5555555555558p-6 represents 0x1.5555555555558 times 2^-6. %a format always prints the exact value of a float or double, unlike %g, which rounds.

We'll also show a third computation, storing results as doubles, but doing the math in long double precision.

Our altered program looks like this:

#include <stdio.h>
#include <math.h>
int main(void)
{
    float x[60];
    x[0] = 1./3;
    x[1] = 1./12;
    for (int i = 2; i < 60; i++) {
        x[i] = 9./4*x[i-1]-1./2*x[i-2];
    }
    double y[60];
    y[0] = 1./3;
    y[1] = 1./12;
    for (int i = 2; i < 60; i++) {
        y[i] = 9./4*y[i-1]-1./2*y[i-2];
    }
    double z[60];
    z[0] = 1./3;
    z[1] = 1./12;
    for (int i = 2; i < 60; i++) {
        z[i] = (long double) 9./4*z[i-1] - (long double) 1./2*z[i-2];
    }
    printf("float:%a, double:%a, double2:%a, formula:%a\n", x[59], y[59], z[59], pow(4,-59)/3);
    for (int i = 0; i < 60; i++) {
        printf("%d %a %a %a\n", i, x[i], y[i], z[i]);
    }
    return 0;
}

And here's the output. I was going to abridge this, but it turns out it's hard to do that without obscuring interesting parts of the pattern:

float:0x1.555556p-120, double:0x1.b6db6db6db6dap+0, double2:0x1.5555555555555p-120, formula:0x1.5555555555555p-120
0 0x1.555556p-2 0x1.5555555555555p-2 0x1.5555555555555p-2
1 0x1.555556p-4 0x1.5555555555555p-4 0x1.5555555555555p-4
2 0x1.555556p-6 0x1.5555555555558p-6 0x1.5555555555555p-6
3 0x1.555556p-8 0x1.555555555557p-8 0x1.5555555555555p-8
4 0x1.555556p-10 0x1.555555555563p-10 0x1.5555555555555p-10
5 0x1.555556p-12 0x1.5555555555c3p-12 0x1.5555555555555p-12
6 0x1.555556p-14 0x1.5555555558c3p-14 0x1.5555555555555p-14
7 0x1.555556p-16 0x1.5555555570c3p-16 0x1.5555555555555p-16
8 0x1.555556p-18 0x1.5555555630c3p-18 0x1.5555555555555p-18
9 0x1.555556p-20 0x1.5555555c30c3p-20 0x1.5555555555555p-20
10 0x1.555556p-22 0x1.5555558c30c3p-22 0x1.5555555555555p-22
11 0x1.555556p-24 0x1.5555570c30c3p-24 0x1.5555555555555p-24
12 0x1.555556p-26 0x1.5555630c30c3p-26 0x1.5555555555555p-26
13 0x1.555556p-28 0x1.5555c30c30c3p-28 0x1.5555555555555p-28
14 0x1.555556p-30 0x1.5558c30c30c3p-30 0x1.5555555555555p-30
15 0x1.555556p-32 0x1.5570c30c30c3p-32 0x1.5555555555555p-32
16 0x1.555556p-34 0x1.5630c30c30c3p-34 0x1.5555555555555p-34
17 0x1.555556p-36 0x1.5c30c30c30c3p-36 0x1.5555555555555p-36
18 0x1.555556p-38 0x1.8c30c30c30c3p-38 0x1.5555555555555p-38
19 0x1.555556p-40 0x1.8618618618618p-39 0x1.5555555555555p-40
20 0x1.555556p-42 0x1.e186186186186p-39 0x1.5555555555555p-42
21 0x1.555556p-44 0x1.bc30c30c30c3p-38 0x1.5555555555555p-44
22 0x1.555556p-46 0x1.b786186186185p-37 0x1.5555555555555p-46
23 0x1.555556p-48 0x1.b6f0c30c30c3p-36 0x1.5555555555555p-48
24 0x1.555556p-50 0x1.b6de186186185p-35 0x1.5555555555555p-50
25 0x1.555556p-52 0x1.b6dbc30c30c3p-34 0x1.5555555555555p-52
26 0x1.555556p-54 0x1.b6db786186185p-33 0x1.5555555555555p-54
27 0x1.555556p-56 0x1.b6db6f0c30c3p-32 0x1.5555555555555p-56
28 0x1.555556p-58 0x1.b6db6de186185p-31 0x1.5555555555555p-58
29 0x1.555556p-60 0x1.b6db6dbc30c3p-30 0x1.5555555555555p-60
30 0x1.555556p-62 0x1.b6db6db786185p-29 0x1.5555555555555p-62
31 0x1.555556p-64 0x1.b6db6db6f0c3p-28 0x1.5555555555555p-64
32 0x1.555556p-66 0x1.b6db6db6de185p-27 0x1.5555555555555p-66
33 0x1.555556p-68 0x1.b6db6db6dbc3p-26 0x1.5555555555555p-68
34 0x1.555556p-70 0x1.b6db6db6db785p-25 0x1.5555555555555p-70
35 0x1.555556p-72 0x1.b6db6db6db6fp-24 0x1.5555555555555p-72
36 0x1.555556p-74 0x1.b6db6db6db6ddp-23 0x1.5555555555555p-74
37 0x1.555556p-76 0x1.b6db6db6db6dbp-22 0x1.5555555555555p-76
38 0x1.555556p-78 0x1.b6db6db6db6dap-21 0x1.5555555555555p-78
39 0x1.555556p-80 0x1.b6db6db6db6dap-20 0x1.5555555555555p-80
40 0x1.555556p-82 0x1.b6db6db6db6dap-19 0x1.5555555555555p-82
41 0x1.555556p-84 0x1.b6db6db6db6dap-18 0x1.5555555555555p-84
42 0x1.555556p-86 0x1.b6db6db6db6dap-17 0x1.5555555555555p-86
43 0x1.555556p-88 0x1.b6db6db6db6dap-16 0x1.5555555555555p-88
44 0x1.555556p-90 0x1.b6db6db6db6dap-15 0x1.5555555555555p-90
45 0x1.555556p-92 0x1.b6db6db6db6dap-14 0x1.5555555555555p-92
46 0x1.555556p-94 0x1.b6db6db6db6dap-13 0x1.5555555555555p-94
47 0x1.555556p-96 0x1.b6db6db6db6dap-12 0x1.5555555555555p-96
48 0x1.555556p-98 0x1.b6db6db6db6dap-11 0x1.5555555555555p-98
49 0x1.555556p-100 0x1.b6db6db6db6dap-10 0x1.5555555555555p-100
50 0x1.555556p-102 0x1.b6db6db6db6dap-9 0x1.5555555555555p-102
51 0x1.555556p-104 0x1.b6db6db6db6dap-8 0x1.5555555555555p-104
52 0x1.555556p-106 0x1.b6db6db6db6dap-7 0x1.5555555555555p-106
53 0x1.555556p-108 0x1.b6db6db6db6dap-6 0x1.5555555555555p-108
54 0x1.555556p-110 0x1.b6db6db6db6dap-5 0x1.5555555555555p-110
55 0x1.555556p-112 0x1.b6db6db6db6dap-4 0x1.5555555555555p-112
56 0x1.555556p-114 0x1.b6db6db6db6dap-3 0x1.5555555555555p-114
57 0x1.555556p-116 0x1.b6db6db6db6dap-2 0x1.5555555555555p-116
58 0x1.555556p-118 0x1.b6db6db6db6dap-1 0x1.5555555555555p-118
59 0x1.555556p-120 0x1.b6db6db6db6dap+0 0x1.5555555555555p-120

Here, we see first that the float computation didn't produce the exact value the pow formula gave (it doesn't have enough precision for that), but it was close enough that the difference was hidden by %g's rounding. We also see that the float values are decreasing by exactly a factor of 4 each time, as are the values from the altered double computation. The double values from the original double version start out almost doing that and then diverge once the amplified error overwhelms the computation. The values eventually start increasing by a factor of 2 instead of decreasing by a factor of 4.

Why is decimal more precise than double if it has a shorter range? C#

what I'm understanding here is that decimal takes more space but provides a shorter range?

Correct. It provides higher precision and smaller range. Plainly if you have a limited number of bits, you can increase precision only by decreasing range!

everyone agrees that decimal should be use when precision is required

Since that statement is false -- in particular, I do not agree with it -- any conclusion you draw from it is not sound.

The purpose of using decimal is not higher precision. It is smaller representation error. Higher precision is one way to achieve smaller representation error, but decimal does not achieve its smaller representation error by being higher precision. It achieves its smaller representation error by exactly representing decimal fractions.

Decimal is for those scenarios where the representation error of a decimal fraction must be zero, such as a financial computation.

Also when doing a calculation like = (1/3)*3, the desire result is 1, but only float and double give me 1

You got lucky. There are lots of fractions where the representation error of that computation is non-zero for both floats and doubles.

Let's do a quick check to see how many there are. We'll just make a million rationals and see:

    var q = from x in Enumerable.Range(1, 1000)
            from y in Enumerable.Range(1, 1000)
            where ((double)x)/y*y != x
            select x + " " + y;
    Console.WriteLine(q.Count()); // 101791

Over 10% of all small-number rationals are represented as doubles with sufficiently large representation error that they do not turn back into whole numbers when multiplied by their denominator!

If your desire is to do exact arithmetic on arbitrary rationals then neither double nor decimal are the appropriate type to use. Use a big-rational library if you need to exactly represent rationals.

why is decimal more precise?

Decimal is more precise than double because it has more bits of precision.

But again, precision is not actually that relevant. What is relevant is that decimal has smaller representation error than double for many common fractions.

It has smaller representation error than double for representing fractions with a small power of ten in the denominator because it was designed specifically to have zero representation error for all fractions with a small power of ten in the denominator.

That's why it is called "decimal", because it represents fractions with powers of ten. It represents the decimal system, which is the system we commonly use for arithmetic.

Double, in contrast, was explicitly not designed to have small representation error. Double was designed to have the range, precision, representation error and performance that is appropriate for physics computations.

There is no bias towards exact decimal quantities in physics. There is such a bias in finance. Use decimals for finance. Use doubles for physics.

What is the precision of long double in C++?

You can find out with std::numeric_limits:

#include <iostream>     // std::cout
#include <limits>       // std::numeric_limits
int main(){
    std::cout << std::numeric_limits<long double>::digits10 << std::endl;
}

Can C# store more precise data than doubles?

Yes, decimal is designed for just that.

However, do be aware that the range of the decimal type is smaller than a double. That is double can hold a larger value, but it does so by losing precision. Or, as stated on MSDN:

The decimal keyword denotes a 128-bit
data type. Compared to floating-point
types, the decimal type has a greater
precision and a smaller range, which
makes it suitable for financial and
monetary calculations. The approximate
range and precision for the decimal
type are shown in the following table.

The primary difference between decimal and double is that decimal is fixed-point and double is floating point. That means that decimal stores an exact value, while double represents a value represented by a fraction, and is less precise. A decimalis 128 bits, so it takes the double space to store. Calculations on decimal is also slower (measure !).

If you need even larger precision, then BigInteger can be used from .NET 4. (You will need to handle decimal points yourself). Here you should be aware, that BigInteger is immutable, so any arithmetic operation on it will create a new instance - if numbers are large, this might be crippling for performance.

I suggest you look into exactly how precise you need to be. Perhaps your algorithm can work with normalized values, that can be smaller ? If performance is an issue, one of the built in floating point types are likely to be faster.

Difference between long double and double in C and C++

To quote the C++ standard, §3.9.1 ¶8:

There are three floating point types: float, double, and long double. The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double. The value representation of floating-point types is implementation-defined. Integral and floating types are collectively called arithmetic types. Specializations of the standard template std::numeric_limits (18.3) shall specify the maximum and minimum values of each arithmetic type for an implementation.

That is to say that double takes at least as much memory for its representation as float and long double at least as much as double. That extra memory is used for more precise representation of a number.

On x86 systems, float is typically 4 bytes long and can store numbers as large as about 3×10³⁸ and about as small as 1.4×10⁻⁴⁵. It is an IEEE 754 single-precision number that stores about 7 decimal digits of a fractional number.

Also on x86 systems, double is 8 bytes long and can store numbers in the IEEE 754 double-precision format, which has a much larger range and stores numbers with more precision, about 15 decimal digits. On some other platforms, double may not be 8 bytes long and may indeed be the same as a single-precision float.

The standard only requires that long double is at least as precise as double, so some compilers will simply treat long double as if it is the same as double. But, on most x86 chips, the 10-byte extended precision format 80-bit number is available through the CPU's floating-point unit, which provides even more precision than 64-bit double, with about 21 decimal digits of precision.

Some compilers instead support a 16-byte (128-bit) IEEE 754 quadruple precision number format with yet more precise representations and a larger range.

C++ Calculating More Precise Than Double or Long Double