Why Double Can Store Bigger Numbers Than Unsigned Long Long

Why double can store bigger numbers than unsigned long long?

The reason is that unsigned long long will store exact integers whereas double stores a mantissa (with limited 52-bit precision) and an exponent.

This allows double to store very large numbers (around 10308) but not exactly. You have about 15 (almost 16) valid decimal digits in a double, and the rest of the 308 possible decimals are zeroes (actually undefined, but you can assume "zero" for better understanding).

An unsigned long long only has 19 digits, but every single of them is exactly defined.

EDIT:
In reply to below comment "how does this exactly work", you have 1 bit for the sign, 11 bits for the exponent, and 52 bits for the mantissa. The mantissa has an implied "1" bit at the beginning, which is not stored, so effectively you have 53 mantissa bits. 253 is 9.007E15, so you have 15, almost 16 decimal digits to work with.

The exponent has a sign bit, and can range from -1022 to +1023, which is used to scale (binary shift left or right) the mantissa (21023 is around 10307, hence the limits on range), so very small and very large numbers are equally possible with this format.

But, of course, all numbers that you can represent only have as much precision as will fit into the matissa.

All in all, floating point numbers are not very intuitive, since "easy" decimal numbers are not necessarily representable as floating point numbers at all. This is due to the fact that the mantissa is binary. For example, it is possible (and easy) to represent any positive integer up to a few billion, or numbers like 0.5 or 0.25 or 0.0125, with perfect precision.

On the other hand, it is also possible to represent a number like 10250, but only approximately. In fact, you will find that 10250 and 10250+1 are the same number (wait, what???). That is because although you can easily have 250 digits, you do not have that many significant digits (read "significant" as "known" or "defined").

Also, representing something seemingly simple like 0.3 is also only possible approximately, even though 0.3 isn't even a "big" number. However, you can't represent 0.3 in binary, and no matter what binary exponent you attach to it, you will not find any binary number that results in exactly 0.3 (but you can get very close).

Some "special values" are reserved for "infinity" (both positive and negative) as well as "not a number", so you have very slightly less than the total theoretical range.

unsigned long long on the other hand, does not interprete the bit pattern in any way. All numbers that you can represent are simply the exact number that is represented by the bit pattern. Every digit of every number is exactly defined, no scaling happens.

how can something be bigger than (unsigned long long) LONG_MAX?

how can something be bigger than (unsigned long long) > LONG_MAX?

Easily. LONG MAX is the maximum value that can be represented as a long int. Converting that to unsigned long long does not change its value, only its data type. The maximum value that can be represented as an unsigned long int is larger on every C implementation you're likely to meet. The maximum value of long long int is larger on some, and the maximum value of unsigned long long int is, again, larger on every C implementation you're likely to meet (much larger on some).

However, this ...

     unsigned long long ull = (unsigned long long) LONG_MAX;
printf("%lu",ull);

... is not a conforming way to investigate the value in question because %lu is a formatting directive for type unsigned long, not unsigned long long. The printf call presented therefore has undefined behavior.

I found this code in an algorithm I need to update:

 if (value > (unsigned long long) LONG_MAX)

EDIT: value is the result of a division of two uint64_t numbers.

[...]

In what case this if statement will
evaluate to true?

Supposing that value has type uint64_t, which is probably the same as unsigned long long in your implementation, the condition in that if statement will evaluate to true at least when the most-significant bit of value is set. If your long int is only 32 bits wide, however, then the condition will evaluate to true much more widely than that, because there are many 64-bit integers that are larger than the largest value representable as a signed 32-bit integer.

I would be inclined to guess that the code was indeed written under the assumption that long int is 32 bits wide, so that the if statement asks a very natural question: "can the result of the previous uint64_t division be represented as a long?" In fact, that's what the if statement is evaluating in any case, but it makes more sense if long is only 32 bits wide, which is typical of 32-bit computers and standard on both 32- and 64-bit Windows.

Largest integer that can be stored in long double

Inasmuch as you express in comments that you want to use long double as a substitute for long long to obtain increased range, I assume that you also require unit precision. Thus, you are asking for the largest number representable by the available number of mantissa digits (LDBL_MANT_DIG) in the radix of the floating-point representation (FLT_RADIX). In the very likely event that FLT_RADIX == 2, you can compute that value like so:

#include <float.h>
#include <math.h>

long double get_max_integer_equivalent() {
long double max_bit = ldexpl(1, LDBL_MANT_DIG - 1);
return max_bit + (max_bit - 1);
}

The ldexp family of functions scale floating-point values by powers of 2, analogous to what the bit-shift operators (<< and >>) do for integers, so the above is similar to

// not reliable for the purpose!
unsigned long long max_bit = 1ULL << (DBL_MANT_DIG - 1);
return max_bit + (max_bit - 1);

Inasmuch as you suppose that your long double provides more mantissa digits than your long long has value bits, however, you must assume that bit shifting would overflow.

There are, of course, much larger values that your long double can express, all of them integers. But they do not have unit precision, and thus the behavior of your long double will diverge from the expected behavior of integers when its values are larger. For example, if long double variable d contains a larger value then at least one of d + 1 == d and d - 1 == d will likely evaluate to true.

Why (c++) casting from long long unsigned int to long double and back produces 0

b can't necessarily contain a, only an approximation of it. long double has a larger range than unsigned long long (hence a larger value of max) but might have fewer mantissa bits to hold the most significant bits of the value, giving less precision for large values.

The maximum unsigned long long value is 2^N-1 where N is the number of bits; probably 64.

If long double has fewer then N mantissa bits, then conversion will round this to one of the two nearest representable values, perhaps 2^N. This is outside the range of unsigned long long, so converting back gives undefined behaviour. Perhaps it's being reduced using modular arithmetic to zero (as would happen if converting from an integer type), but in principle anything could happen.

C++ Addition of very large unsigned long and double

The question is what are the exact rules for C++ for evaluating an expression in which data format that lead to this unfortunate result.

Let's inspect the line:

ul += d;

Where d has type double and ul has type unsigned long.

From 7.6.19 Assignment and compound assignment operators :

The behavior of an expression of the form E1 op= E2 is equivalent to E1 = E1 op E2 except that E1 is evaluated only once

So ul += d is equal to ul = ul + d.

From 7.6.6 Additive operators :

The additive operators + and - group left-to-right.
The usual arithmetic conversions are performed for operands of arithmetic or enumeration type.

So both ul and d are promoted in ul + d.

From 7.4 Usual arithmetic conversions :

[...] This pattern is called the usual arithmetic conversions, which are defined as follows:

  • [...]

  • Otherwise, if either operand is double, the other shall be converted to double.

  • [...]

So ul is converted to double in ul + d.

From 7.3.11 Floating-integral conversions emphasis mine:

A prvalue of an integer type or of an unscoped enumeration type can be converted to a prvalue of a floating-point type.
The result is exact if possible.
If the value being converted is in the range of values that can be represented but the value cannot be represented exactly, it is an implementation-defined choice of either the next lower or higher representable value.

If the value being converted is outside the range of values that can be represented, the behavior is undefined.

So it is implementation defined if the value of ul can't be represented exactly in double which value is used.

And then, after calculation, the double result is converted back to unsigned long in assignment to ul, so also from Floating-integral conversions emphasis mine:

A prvalue of a floating-point type can be converted to a prvalue of an integer type.
The conversion truncates; that is, the fractional part is discarded.
The behavior is undefined if the truncated value cannot be represented in the destination type.



The output of this code is 0 for all values n < 1024. Why?

Gcc compiler documents that it follows C99 Annex F when converting floats to integers and back, see gcc11.1.0 docs implementation defined beavior 4.6 Floating point, but I see the result in C99 Annex F is unspecified, but a floating point exception is required to be raised. The following code with function copied from cppreference feexceptflag

#include <iostream>
#include <limits>
#include <cfenv>

void show_fe_exceptions(void)
{
printf("current exceptions raised: ");
if(fetestexcept(FE_DIVBYZERO)) printf(" FE_DIVBYZERO");
if(fetestexcept(FE_INEXACT)) printf(" FE_INEXACT");
if(fetestexcept(FE_INVALID)) printf(" FE_INVALID");
if(fetestexcept(FE_OVERFLOW)) printf(" FE_OVERFLOW");
if(fetestexcept(FE_UNDERFLOW)) printf(" FE_UNDERFLOW");
if(fetestexcept(FE_ALL_EXCEPT)==0) printf(" none");
printf("\n");
}

int main(int argc, char **argv) {
unsigned long n = 10ul;
unsigned long ul = std::numeric_limits<unsigned long>::max() - n;
double d = 1.;
show_fe_exceptions();
ul += d;
show_fe_exceptions();
std::cout << ul << std::endl;
}

outputs on godbolt and confirms the exception is raised:

current exceptions raised:  none
current exceptions raised: FE_INEXACT FE_INVALID
0

Larger than Unsigned Long Long

Short Answer
Go for a 3rd party library.

Long Answer
When dealing with large numbers, probably one of the most fundamental design decisions is how am I going to represent the large number?

Will it be a string, an array, a list, or custom (homegrown) storage class.

After that decision is made, the actual math operations can be broken down in smaller parts and then executed with native language types such as int or integer.

Even with strings there is a limit in the number of characters or "numbers" in the number, as indicated here:

What is the maximum possible length of a .NET string?

You might also want to check: Arbitrary description Arithmetic

C - Unsigned long long to double on 32-bit machine

uint64_t vs double, which has a higher range limit for covering positive numbers?

uint64_t, where supported, has 64 value bits, no padding bits, and no sign bit. It can represent all integers between 0 and 264 - 1, inclusive.

Substantially all modern C implementations represent double in IEEE-754 64-bit binary format, but C does not require nor even endorse that format. It is so common, however, that it is fairly safe to assume that format, and maybe to just put in some compile-time checks against the macros defining FP characteristics. I will assume for the balance of this answer that the C implementation indeed does use that representation.

IEEE-754 binary double precision provides 53 bits of mantissa, therefore it can represent all integers between 0 and 253 - 1. It is a floating-point format, however, with an 11-bit binary exponent. The largest number it can represent is (253 - 1) * 21023, or nearly 21077. In this sense, double has a much greater range than uint64_t, but the vast majority of integers between 0 and its maximum value cannot be represented exactly as doubles, including almost all of the numbers that can be represented exactly by uint64_t.

How to convert double into uint64_t if only the whole number part of double is needed

You can simply assign (conversion is implicit), or you can explicitly cast if you want to make it clear that a conversion takes place:

double my_double = 1.2345678e48;
uint64_t my_uint;
uint64_t my_other_uint;

my_uint = my_double;
my_other_uint = (uint64_t) my_double;

Any fractional part of the double's value will be truncated. The integer part will be preserved exactly if it is representable as a uint64_t; otherwise, the behavior is undefined.

The code you presented uses a union to overlay storage of a double and a uint64_t. That's not inherently wrong, but it's not a useful technique for converting between the two types. Casts are C's mechanism for all non-implicit value conversions.



Related Topics



Leave a reply



Submit