What Is the Size of Float and Double in C and C++

What is the difference between float and double?

Huge difference.

As the name implies, a double has 2x the precision of float[1]. In general a double has 15 decimal digits of precision, while float has 7.

Here's how the number of digits are calculated:

double has 52 mantissa bits + 1 hidden bit: log(253)÷log(10) = 15.95 digits

float has 23 mantissa bits + 1 hidden bit: log(224)÷log(10) = 7.22 digits

This precision loss could lead to greater truncation errors being accumulated when repeated calculations are done, e.g.

float a = 1.f / 81;
float b = 0;
for (int i = 0; i < 729; ++ i)
b += a;
printf("%.7g\n", b); // prints 9.000023

while

double a = 1.0 / 81;
double b = 0;
for (int i = 0; i < 729; ++ i)
b += a;
printf("%.15g\n", b); // prints 8.99999999999996

Also, the maximum value of float is about 3e38, but double is about 1.7e308, so using float can hit "infinity" (i.e. a special floating-point number) much more easily than double for something simple, e.g. computing the factorial of 60.

During testing, maybe a few test cases contain these huge numbers, which may cause your programs to fail if you use floats.


Of course, sometimes, even double isn't accurate enough, hence we sometimes have long double[1] (the above example gives 9.000000000000000066 on Mac), but all floating point types suffer from round-off errors, so if precision is very important (e.g. money processing) you should use int or a fraction class.


Furthermore, don't use += to sum lots of floating point numbers, as the errors accumulate quickly. If you're using Python, use fsum. Otherwise, try to implement the Kahan summation algorithm.


[1]: The C and C++ standards do not specify the representation of float, double and long double. It is possible that all three are implemented as IEEE double-precision. Nevertheless, for most architectures (gcc, MSVC; x86, x64, ARM) float is indeed a IEEE single-precision floating point number (binary32), and double is a IEEE double-precision floating point number (binary64).

what is the exact range of long double in c++?

Why not ask your compiler? std::numeric_limits<long double> is defined.

long double smallest = std::numeric_limits<long double>::min();
long double largest = std::numeric_limits<long double>::max();
long double lowest = std::numeric_limits<long double>::lowest();
std::cout << "lowest " << lowest << std::endl;
std::cout << "largest " << largest << std::endl;
std::cout << "smallest " << smallest << std::endl;

Running this code on godbolt.org with x86-64 GCC 11.2 gives me:

lowest -1.18973e+4932
largest 1.18973e+4932
smallest 3.3621e-4932

It might vary for your platform, of course.

float' vs. 'double' precision

Floating point numbers in C use IEEE 754 encoding.

This type of encoding uses a sign, a significand, and an exponent.

Because of this encoding, many numbers will have small changes to allow them to be stored.

Also, the number of significant digits can change slightly since it is a binary representation, not a decimal one.

Single precision (float) gives you 23 bits of significand, 8 bits of exponent, and 1 sign bit.

Double precision (double) gives you 52 bits of significand, 11 bits of exponent, and 1 sign bit.

Difference between long double and double in C and C++

To quote the C++ standard, §3.9.1 ¶8:

There are three floating point types: float, double, and long double. The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double. The value representation of floating-point types is implementation-defined. Integral and floating types are collectively called arithmetic types. Specializations of the standard template std::numeric_limits (18.3) shall specify the maximum and minimum values of each arithmetic type for an implementation.

That is to say that double takes at least as much memory for its representation as float and long double at least as much as double. That extra memory is used for more precise representation of a number.

On x86 systems, float is typically 4 bytes long and can store numbers as large as about 3×10³⁸ and about as small as 1.4×10⁻⁴⁵. It is an IEEE 754 single-precision number that stores about 7 decimal digits of a fractional number.

Also on x86 systems, double is 8 bytes long and can store numbers in the IEEE 754 double-precision format, which has a much larger range and stores numbers with more precision, about 15 decimal digits. On some other platforms, double may not be 8 bytes long and may indeed be the same as a single-precision float.

The standard only requires that long double is at least as precise as double, so some compilers will simply treat long double as if it is the same as double. But, on most x86 chips, the 10-byte extended precision format 80-bit number is available through the CPU's floating-point unit, which provides even more precision than 64-bit double, with about 21 decimal digits of precision.

Some compilers instead support a 16-byte (128-bit) IEEE 754 quadruple precision number format with yet more precise representations and a larger range.

How to guarantee exact size of double in C?

How to guarantee exact size of double in C?

Use _Static_assert()

#include <limits.h>

int main(void) {
_Static_assert(sizeof (double)*CHAR_BIT == 64, "Unexpected double size");
return 0;
}

_Static_assert available since C11. Otherwise code could use a run-time assert.

#include <assert.h>
#include <limits.h>

int main(void) {
assert(sizeof (double)*CHAR_BIT == 64);
return 0;
}

Although this will insure the size of a double is 64, it does not insure IEEE 754 double-precision binary floating-point format adherence.

Code could use __STDC_IEC_559__

An implementation that defines __STDC_IEC_559__ shall conform to the specifications in this annex` C11 Annex F IEC 60559 floating-point arithmetic

Yet that may be too strict. Many implementations adhere to most of that standard, yet still do no set the macro.


would there be some standard way to guarantee size of a floating point type or double?

The best guaranteed is to write the FP value as its hex representation or as an exponential with sufficient decimal digits. See Printf width specifier to maintain precision of floating-point value

Ranges of floating point datatype in C?

These numbers come from the IEEE-754 standard, which defines the standard representation of floating point numbers. Wikipedia article at the link explains how to arrive at these ranges knowing the number of bits used for the signs, mantissa, and the exponent.

Are there float and double types with fixed sizes in C99?

I found the answer in Any guaranteed minimum sizes for types in C?

Quoting Jed Smith (with corrected link to C99 standard):

Yes, the values in float.h and limits.h are system dependent. You should never make assumptions about the width of a type, but the standard does lay down some minimums. See §6.2.5 and §5.2.4.2.1 in the C99 standard.

For example, the standard only says that a char should be large enough to hold every character in the execution character set. It doesn't say how wide it is.

For the floating-point case, the standard hints at the order in which the widths of the types are given:

§6.2.5.10

There are three real floating types, designated as float, double, and long
double
. 32) The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double.

They implicitly defined which is wider than the other, but not specifically how wide they are. "Subset" itself is vague, because a long double can have the exact same range of a double and satisfy this clause.

This is pretty typical of how C goes, and a lot is left to each individual environment. You can't assume, you have to ask the compiler.



Related Topics



Leave a reply



Submit