Is Floating Point Arbitrary Precision Available

Is floating point arbitrary precision available?

In the standard library, the decimal module may be what you're looking for. Also, I have found mpmath to be quite helpful. The documentation has many great examples as well (unfortunately my office computer does not have mpmath installed; otherwise I would verify a few examples and post them).

One caveat about the decimal module, though. The module contains several in-built functions for simple mathematical operations (e.g. sqrt), but the results from these functions may not always match the corresponding function in math or other modules at higher precisions (although they may be more accurate). For example,

from decimal import *
import math

getcontext().prec = 30
num = Decimal(1) / Decimal(7)

print(" math.sqrt: {0}".format(Decimal(math.sqrt(num))))
print("decimal.sqrt: {0}".format(num.sqrt()))

In Python 3.2.3, this outputs the first two lines

   math.sqrt: 0.37796447300922719758631274089566431939601898193359375
decimal.sqrt: 0.377964473009227227214516536234
actual value: 0.3779644730092272272145165362341800608157513118689214

which as stated, isn't exactly what you would expect, and you can see that the higher the precision, the less the results match. Note that the decimal module does have more accuracy in this example, since it more closely matches the actual value.

Python : arbitrary precision with floats

I don't think it's possible to approximate exp() with integers. If you use 3**n instead of 2.71828182845905**n, your calculations will be completely useless.

One possible solution would be to use Sympy. According to the documentation:

There is essentially no upper precision limit

>>> from sympy import *
>>> exp(9500)
exp(9500)
>>> exp(9500).evalf()
6.27448493490172e+4125

You can also specify the desired precision:

>>> exp(9500).evalf(1000)
6.274484934901720177929867046175406311474380389941415760684209191232450360090766458256588885184199320756050569665785657269735313171886975309933254563488343491718198237894473901620914303565550450204805537225888529509352754121292701357622411614860860409639719786022989336837263283678476008817556351031696366815467221836948040042378034720460820127399855873232167818091083005170669482845098735176209372328114732133251096196535355946589133977397512846130629857604295369747597459602137604440011394793443041829253598478244189078131130488653468669559814695095974271938947640276013215753183113041899037415404445478806695965167014404297848725756879184380559837391976534521522360723388582608454995349380217499779247330557664230806254642768796486899322646423713763772064068933790640394967085887914192401473425799354391464743910233873602389444180426155866237536459654917521713769608318128404177877383203786348495822099924812081683286880293701785567962687838594752986160305764297117036426951203418854463404773701882e+4125

With exp(9500).evalf(5000), you even get the integer closest to exp(9500).

Which programming languages have arbitrary precision floating-point literals?

Haskell has Rationals backed by arbitrary precision Integers, and it has overloaded numeric literals, so that most often your literals have exactly the type you want.

Floats vs rationals in arbitrary precision fractional arithmetic (C/C++)

The question is what you mean by arbitrary precision that you mention in the title. Does it mean "arbitrary, but pre-determined at compile-time and fixed at run-time"? Or does it mean "infinite, i.e. extendable at run-time to represent any rational number"?

In the former case (precision customizable at compile-time, but fixed afterwards) I'd say that one of the most efficient solutions would actually be fixed-point arithmetic (i.e. none of the two you mentioned).

Firstly, fixed-point arithmetic does not require any dedicated library for basic arithmetic operations. It is just a concept overlaid over integer arithmetic. This means that if you really need a lot of digits after the dot, you can take any big-integer library, multiply all your data, say, by 2^64 and you basically immediately get fixed-point arithmetic with 64 binary digits after the dot (at least as long as arithmetic operations are concerned, with some extra adjustments for multiplication and division). This is typically significantly more efficient than floating-point or rational representations.

Note also that in many practical applications multiplication operations are often accompanied by division operations (as in x = y * a / b) that "compensate" for each other, meaning that often it is unnecessary to perform any adjustments for such multiplications and divisions. This also contributes to efficiency of fixed-point arithmetic.

Secondly, fixed-point arithmetic provides uniform precision across the entire range. This is not true for either floating-point or rational representations, which in some applications could be a significant drawback for the latter two approaches (or a benefit, depending on what you need).

So, again, why are you considering floating-point and rational representations only. Is there something that prevents you from considering fixed-point representation?

How to deal with floating point number precision in JavaScript?

From the Floating-Point Guide:

What can I do to avoid this problem?

That depends on what kind of
calculations you’re doing.

  • If you really need your results to add up exactly, especially when you
    work with money: use a special decimal
    datatype.
  • If you just don’t want to see all those extra decimal places: simply
    format your result rounded to a fixed
    number of decimal places when
    displaying it.
  • If you have no decimal datatype available, an alternative is to work
    with integers, e.g. do money
    calculations entirely in cents. But
    this is more work and has some
    drawbacks.

Note that the first point only applies if you really need specific precise decimal behaviour. Most people don't need that, they're just irritated that their programs don't work correctly with numbers like 1/10 without realizing that they wouldn't even blink at the same error if it occurred with 1/3.

If the first point really applies to you, use BigDecimal for JavaScript, which is not elegant at all, but actually solves the problem rather than providing an imperfect workaround.

Floating Point Numbers

Basically, the computer's floating point calculations have rounding errors. So if you perform 1.*1000./1000., you can end up with 1.0000004 or something like that. It is what the computer stores in the memory. However, you probably don't want to see 1.0000004 as a result of that calculation. So when you print the result, the computer does the rounding, and you will get simply 1. But you have to know that it is not the real value in the computer's memory - it is only a comfortable visualization of your actual floating point number.

Arbitrary-precision arithmetic Explanation

It's all a matter of adequate storage and algorithms to treat numbers as smaller parts. Let's assume you have a compiler in which an int can only be 0 through 99 and you want to handle numbers up to 999999 (we'll only worry about positive numbers here to keep it simple).

You do that by giving each number three ints and using the same rules you (should have) learned back in primary school for addition, subtraction and the other basic operations.

In an arbitrary precision library, there's no fixed limit on the number of base types used to represent our numbers, just whatever memory can hold.

Addition for example: 123456 + 78:

12 34 56
78
-- -- --
12 35 34

Working from the least significant end:

  • initial carry = 0.
  • 56 + 78 + 0 carry = 134 = 34 with 1 carry
  • 34 + 00 + 1 carry = 35 = 35 with 0 carry
  • 12 + 00 + 0 carry = 12 = 12 with 0 carry

This is, in fact, how addition generally works at the bit level inside your CPU.

Subtraction is similar (using subtraction of the base type and borrow instead of carry), multiplication can be done with repeated additions (very slow) or cross-products (faster) and division is trickier but can be done by shifting and subtraction of the numbers involved (the long division you would have learned as a kid).

I've actually written libraries to do this sort of stuff using the maximum powers of ten that can be fit into an integer when squared (to prevent overflow when multiplying two ints together, such as a 16-bit int being limited to 0 through 99 to generate 9,801 (<32,768) when squared, or 32-bit int using 0 through 9,999 to generate 99,980,001 (<2,147,483,648)) which greatly eased the algorithms.

Some tricks to watch out for.

1/ When adding or multiplying numbers, pre-allocate the maximum space needed then reduce later if you find it's too much. For example, adding two 100-"digit" (where digit is an int) numbers will never give you more than 101 digits. Multiply a 12-digit number by a 3 digit number will never generate more than 15 digits (add the digit counts).

2/ For added speed, normalise (reduce the storage required for) the numbers only if absolutely necessary - my library had this as a separate call so the user can decide between speed and storage concerns.

3/ Addition of a positive and negative number is subtraction, and subtracting a negative number is the same as adding the equivalent positive. You can save quite a bit of code by having the add and subtract methods call each other after adjusting signs.

4/ Avoid subtracting big numbers from small ones since you invariably end up with numbers like:

         10
11-
-- -- -- --
99 99 99 99 (and you still have a borrow).

Instead, subtract 10 from 11, then negate it:

11
10-
--
1 (then negate to get -1).

Here are the comments (turned into text) from one of the libraries I had to do this for. The code itself is, unfortunately, copyrighted, but you may be able to pick out enough information to handle the four basic operations. Assume in the following that -a and -b represent negative numbers and a and b are zero or positive numbers.

For addition, if signs are different, use subtraction of the negation:

-a +  b becomes b - a
a + -b becomes a - b

For subtraction, if signs are different, use addition of the negation:

 a - -b becomes   a + b
-a - b becomes -(a + b)

Also special handling to ensure we're subtracting small numbers from large:

small - big becomes -(big - small)

Multiplication uses entry-level math as follows:

475(a) x 32(b) = 475 x (30 + 2)
= 475 x 30 + 475 x 2
= 4750 x 3 + 475 x 2
= 4750 + 4750 + 4750 + 475 + 475

The way in which this is achieved involves extracting each of the digits of 32 one at a time (backwards) then using add to calculate a value to be added to the result (initially zero).

ShiftLeft and ShiftRight operations are used to quickly multiply or divide a LongInt by the wrap value (10 for "real" math). In the example above, we add 475 to zero 2 times (the last digit of 32) to get 950 (result = 0 + 950 = 950).

Then we left shift 475 to get 4750 and right shift 32 to get 3. Add 4750 to zero 3 times to get 14250 then add to result of 950 to get 15200.

Left shift 4750 to get 47500, right shift 3 to get 0. Since the right shifted 32 is now zero, we're finished and, in fact 475 x 32 does equal 15200.

Division is also tricky but based on early arithmetic (the "gazinta" method for "goes into"). Consider the following long division for 12345 / 27:

       457
+-------
27 | 12345 27 is larger than 1 or 12 so we first use 123.
108 27 goes into 123 4 times, 4 x 27 = 108, 123 - 108 = 15.
---
154 Bring down 4.
135 27 goes into 154 5 times, 5 x 27 = 135, 154 - 135 = 19.
---
195 Bring down 5.
189 27 goes into 195 7 times, 7 x 27 = 189, 195 - 189 = 6.
---
6 Nothing more to bring down, so stop.

Therefore 12345 / 27 is 457 with remainder 6. Verify:

  457 x 27 + 6
= 12339 + 6
= 12345

This is implemented by using a draw-down variable (initially zero) to bring down the segments of 12345 one at a time until it's greater or equal to 27.

Then we simply subtract 27 from that until we get below 27 - the number of subtractions is the segment added to the top line.

When there are no more segments to bring down, we have our result.


Keep in mind these are pretty basic algorithms. There are far better ways to do complex arithmetic if your numbers are going to be particularly large. You can look into something like GNU Multiple Precision Arithmetic Library - it's substantially better and faster than my own libraries.

It does have the rather unfortunate misfeature in that it will simply exit if it runs out of memory (a rather fatal flaw for a general purpose library in my opinion) but, if you can look past that, it's pretty good at what it does.

If you cannot use it for licensing reasons (or because you don't want your application just exiting for no apparent reason), you could at least get the algorithms from there for integrating into your own code.

I've also found that the bods over at MPIR (a fork of GMP) are more amenable to discussions on potential changes - they seem a more developer-friendly bunch.



Related Topics



Leave a reply



Submit