Efficient way to round double precision numbers to a lower precision given in number of bits
Dekker’s algorithm will split a floating-point number into high and low parts. If there are s bits in the significand (53 in IEEE 754 64-bit binary), then *x0
receives the high s-b bits, which is what you requested, and *x1
receives the remaining bits, which you may discard. In the code below, Scale
should have the value 2b. If b is known at compile time, e.g., the constant 43, you can replace Scale
with 0x1p43
. Otherwise, you must produce 2b in some way.
This requires round-to-nearest mode. IEEE 754 arithmetic suffices, but other reasonable arithmetic may be okay too. It rounds ties to even, which is not what you requested (ties upward). Is that necessary?
This assumes that x * (Scale + 1)
does not overflow. The operations must be evaluated in double precision (not greater).
void Split(double *x0, double *x1, double x)
{
double d = x * (Scale + 1);
double t = d - x;
*x0 = d - t;
*x1 = x - *x0;
}
How to round a double/float to BINARY precision?
Yes, rounding off binary digits makes more sense than going through BigDecimal
and can be implemented very efficiently if you are not worried about being within a small factor of Double.MAX_VALUE
.
You can round a floating-point double
value x
with the following sequence in Java (untested):
double t = 9 * x; // beware: this overflows if x is too close to Double.MAX_VALUE
double y = x - t + t;
After this sequence, y
should contain the rounded value. Adjust the distance between the two set bits in the constant 9
in order to adjust the number of bits that are rounded off. The value 3
rounds off one bit. The value 5
rounds off two bits. The value 17
rounds off four bits, and so on.
This sequence of instruction is attributed to Veltkamp and is typically used in “Dekker multiplication”. This page has some references.
Rounding to specfic digits fails with this double-precision value
Double is a floating binary point type. They are represented in binary system (like 11010.00110
). When double is presented in decimal system it is only an approximation as not all binary numbers have exact representation in decimal system. Try for example this operation:
double d = 3.65d + 0.05d;
It will not result in 3.7
but in 3.6999999999999997
. It is because the variable contains a closest available double
.
The same happens in your case. Your variable contains closest available double
.
For precise operations double
/float
is not the most fortunate choice.
Use double
/float
when you need fast performance or you want to operate on larger range of numbers, but where high precision is not required. For instance, it is perfect type for calculations in physics.
For precise decimal operations use, well, decimal
.
Here is an article about float
/decimal
: http://csharpindepth.com/Articles/General/FloatingPoint.aspx
Related Topics
Deserializing JSON Array into Strongly Typed .Net Object
Local Database, I Need Some Examples
How to Hide Public Methods from Intellisense
How to Terminate a Thread in C#
How to Know the Repeating Decimal in a Fraction
Datagridtextcolumn Visibility Binding
Combobox.Selectedtext Doesn't Give Me the Selectedtext
Is There an Entity Framework 7 Database-First Poco Generator
Setting the Initial Directory of an Savefiledialog
Escape Quote in C# for JavaScript Consumption
Is Using an an 'Async' Lambda with 'Task.Run()' Redundant
How to Parse String with Hours Greater Than 24 to Timespan
How to Ignore Get-Only Properties in JSON.Net Without Using JSONignore Attributes