Does the C++ Standard Specify Anything on the Representation of Floating Point Numbers

Does the C++ standard specify anything on the representation of floating point numbers?

From N3337:

[basic.fundamental/8]: There are three floating point types: float, double, and long double. The type double provides at least
as much precision as float, and the type long double provides at least as much precision as double.
The set of values of the type float is a subset of the set of values of the type double; the set of values
of the type double is a subset of the set of values of the type long double. The value representation of
floating-point types is implementation-defined
. Integral and floating types are collectively called arithmetic
types. Specializations of the standard template std::numeric_limits (18.3) shall specify the maximum
and minimum values of each arithmetic type for an implementation.

If you want to check if your implementation uses IEEE-754, you can use std::numeric_limits::is_iec559:

static_assert(std::numeric_limits<double>::is_iec559,
"This code requires IEEE-754 doubles");

There are a number of other helper traits in this area, such as has_infinity, quiet_NaN and more.

C/C++: Are IEEE 754 float addition/multiplication/... and int-to-float conversion standardized?

No, unless the macro __STD_IEC_559__ is defined.

Basically the standard does not require IEEE 754 compatible floating point, so most compilers will use whatever floating point support the hardware provides. If the hardware provides IEEE compatible floating point, most compilers for that target will use it and predefine the __STD_IEC_559__ macro.

If the macro is defined, then IEEE 754 guarantees the bit representation (but not the byte order) of float and double as 32-bit and 64-bit IEEE 754. This in turn guarantees bit-exact representation of double arithmetic (but note that the C standard allows float arithmetic to happen at either 32 bit or 64 bit precision).

The C standard requires that float to int conversion be the same as the trunc function if the result is in range for the resulting type, but unfortunately IEEE doesn't actually define the behavior of functions, just of basic arithmetic. The C spec also allows the compiler reorder operations in violation of IEEE754 (which might affect precision), but most that support IEEE754 will not do that wihout a command line option.

Anecdotal evidence also suggest that some compilers do not define the macro even though they should while other compilers define it when they should not (do not follow all the requirements of IEEE 754 strictly). These cases should probably be considered compiler bugs.

How are floating point numbers stored in memory?

To understand how they are stored, you must first understand what they are and what kind of values they are intended to handle.

Unlike integers, a floating-point value is intended to represent extremely small values as well as extremely large. For normal 32-bit floating-point values, this corresponds to values in the range from 1.175494351 * 10^-38 to 3.40282347 * 10^+38.

Clearly, using only 32 bits, it's not possible to store every digit in such numbers.

When it comes to the representation, you can see all normal floating-point numbers as a value in the range 1.0 to (almost) 2.0, scaled with a power of two. So:

  • 1.0 is simply 1.0 * 2^0,
  • 2.0 is 1.0 * 2^1, and
  • -5.0 is -1.25 * 2^2.

So, what is needed to encode this, as efficiently as possible? What do we really need?

  • The sign of the expression.
  • The exponent
  • The value in the range 1.0 to (almost) 2.0. This is known as the "mantissa" or the significand.

This is encoded as follows, according to the IEEE-754 floating-point standard.

  • The sign is a single bit.
  • The exponent is stored as an unsigned integer, for 32-bits floating-point values, this field is 8 bits. 1 represents the smallest exponent and "all ones - 1" the largest. (0 and "all ones" are used to encode special values, see below.) A value in the middle (127, in the 32-bit case) represents zero, this is also known as the bias.
  • When looking at the mantissa (the value between 1.0 and (almost) 2.0), one sees that all possible values start with a "1" (both in the decimal and binary representation). This means that it's no point in storing it. The rest of the binary digits are stored in an integer field, in the 32-bit case this field is 23 bits.

In addition to the normal floating-point values, there are a number of special values:

  • Zero is encoded with both exponent and mantissa as zero. The sign bit is used to represent "plus zero" and "minus zero". A minus zero is useful when the result of an operation is extremely small, but it's still important to know from which direction the operation came from.
  • plus and minus infinity -- represented using an "all ones" exponent and a zero mantissa field.
  • Not a Number (NaN) -- represented using an "all ones" exponent and a non-zero mantissa.
  • Denormalized numbers -- numbers smaller than the smallest normal number. Represented using a zero exponent field and a non-zero mantissa. The special thing with these numbers is that the precision (i.e. the number of digits a value can contain) will drop the smaller the value becomes, simply because there is not room for them in the mantissa.

Finally, the following is a handful of concrete examples (all values are in hex):

  • 1.0 : 3f800000
  • -1234.0 : c49a4000
  • 100000000000000000000000.0: 65a96816

How are floats and doubles interpreted by the compiler and eventually represented in memory?

The core idea of floating-point representations, is that a number x is written as m*b^e where m is a mantissa or fractional part, b is a base, and e is an exponent.

e.g.

0.375 = 1.5*(2^(-2))

The IEEE-754 floating-point standard

The IEEE-754 floating-point standard is a standard for representing and manipulating floating-point quantities that is followed by all modern computer systems. It defines several standard representations of floating-point numbers, all of which have the following basic pattern (the specific layout here is for 32-bit floats):

Sample Image

The bit numbers are counting from the least-significant bit. The first bit is the sign (0 for positive, 1 for negative). The following 8 bits are the exponent in excess-127 binary notation; this means that the binary pattern 01111111 = 127 represents an exponent of 0, 1000000 = 128, represents 1, 01111110 = 126 represents -1, and so forth. The mantissa fits in the remaining 24 bits, with its leading 1 stripped off as described above.

Certain numbers have a special representation. Because 0 cannot be represented in the standard form (there is no 1 before the decimal point), it is given the special representation 0 00000000 00000000000000000000000. (There is also a -0 = 1 00000000 00000000000000000000000, which looks equal to +0 but prints differently.) Numbers with exponents of 11111111 = 255 = 2128 represent non-numeric quantities such as not a number (NaN), returned by operations like (0.0/0.0) and positive or negative infinity.

example:

     0 =                        0 = 0 00000000 00000000000000000000000
-0 = -0 = 1 00000000 00000000000000000000000
0.125 = 0.125 = 0 01111100 00000000000000000000000
0.25 = 0.25 = 0 01111101 00000000000000000000000
0.5 = 0.5 = 0 01111110 00000000000000000000000
1 = 1 = 0 01111111 00000000000000000000000
2 = 2 = 0 10000000 00000000000000000000000
4 = 4 = 0 10000001 00000000000000000000000
8 = 8 = 0 10000010 00000000000000000000000
0.375 = 0.375 = 0 01111101 10000000000000000000000
0.75 = 0.75 = 0 01111110 10000000000000000000000
1.5 = 1.5 = 0 01111111 10000000000000000000000
3 = 3 = 0 10000000 10000000000000000000000
6 = 6 = 0 10000001 10000000000000000000000
0.1 = 0.10000000149011612 = 0 01111011 10011001100110011001101
0.2 = 0.20000000298023224 = 0 01111100 10011001100110011001101
0.4 = 0.40000000596046448 = 0 01111101 10011001100110011001101
0.8 = 0.80000001192092896 = 0 01111110 10011001100110011001101
1e+12 = 999999995904 = 0 10100110 11010001101010010100101
1e+24 = 1.0000000138484279e+24 = 0 11001110 10100111100001000011100
1e+36 = 9.9999996169031625e+35 = 0 11110110 10000001001011111001110
inf = inf = 0 11111111 00000000000000000000000
-inf = -inf = 1 11111111 00000000000000000000000
nan = nan = 0 11111111 10000000000000000000000

For a 64-bit double, the size of both the exponent and mantissa are larger:

  • signed - 1
  • exponent - 11
  • mantissa - 52

Intel processors internally use an even larger 80-bit floating-point format for all operations:

  • signed - 1
  • exponent - 15
  • mantissa - 64

SOURCE

Largest value representable by a floating-point type smaller than 1

You can use the std::nextafter function, which, despite its name, can retrieve the next representable value that is arithmetically before a given starting point, by using an appropriate to argument. (Often -Infinity, 0, or +Infinity).

This works portably by definition of nextafter, regardless of what floating-point format your C++ implementation uses. (Binary vs. decimal, or width of mantissa aka significand, or anything else.)

Example: Retrieving the closest value less than 1 for the double type (on Windows, using the clang-cl compiler in Visual Studio 2019), the answer is different from the result of the 1 - ε calculation (which as discussed in comments, is incorrect for IEEE754 numbers; below any power of 2, representable numbers are twice as close together as above it):

#include <iostream>
#include <iomanip>
#include <cmath>
#include <limits>

int main()
{
double naft = std::nextafter(1.0, 0.0);
std::cout << std::fixed << std::setprecision(20);
std::cout << naft << '\n';
double neps = 1.0 - std::numeric_limits<double>::epsilon();
std::cout << neps << '\n';
return 0;
}

Output:

0.99999999999999988898
0.99999999999999977796

With different output formatting, this could print as 0x1.fffffffffffffp-1 and 0x1.ffffffffffffep-1 (1 - ε)


Note that, when using analogous techniques to determine the closest value that is greater than 1, then the nextafter(1.0, 10000.) call gives the same value as the 1 + ε calculation (1.00000000000000022204), as would be expected from the definition of ε.



Performance

C++23 requires std::nextafter to be constexpr, but currently only some compilers support that. GCC does do constant-propagation through it, but clang can't (Godbolt). If you want this to be as fast (with optimization enabled) as a literal constant like 0x1.fffffffffffffp-1; for systems where double is IEEE754 binary64, on some compilers you'll have to wait for that part of C++23 support. (It's likely that once compilers are able to do this, like GCC they'll optimize even without actually using -std=c++23.)

const double DoubleBelowOne = std::nextafter(1.0, 0.); at global scope will at worst run the function once at startup, defeating constant propagation where it's used, but otherwise performing about the same as FP literal constants when used with other runtime variables.



Related Topics



Leave a reply



Submit