How Computer Does Floating Point Arithmetic

How computer does floating point arithmetic?

The problem is that the floating point format represents fractions in base 2.

The first fraction bit is ½, the second ¼, and it goes on as 1 / 2n.

And the problem with that is that not every rational number (a number that can be expressed as the ratio of two integers) actually has a finite representation in this base 2 format.

(This makes the floating point format difficult to use for monetary values. Although these values are always rational numbers (n/100) only .00, .25, .50, and .75 actually have exact representations in any number of digits of a base two fraction.
)

Anyway, when you add them back, the system eventually gets a chance to round the result to a number that it can represent exactly.

At some point, it finds itself adding the .666... number to the .333... one, like so:

  00111110 1  .o10101010 10101010 10101011
+ 00111111 0 .10101010 10101010 10101011o
------------------------------------------
00111111 1 (1).0000000 00000000 0000000x # the x isn't in the final result

The leftmost bit is the sign, the next eight are the exponent, and the remaining bits are the fraction. In between the exponent and the fraction is an assummed "1" that is always present, and therefore not actually stored, as the normalized leftmost fraction bit. I've written zeroes that aren't actually present as individual bits as o.

A lot has happened here, at each step, the FPU has taken rather heroic measures to round the result. Two extra digits of precision (beyond what will fit in the result) have been kept, and the FPU knows in many cases if any, or at least 1, of the remaining rightmost bits were one. If so, then that part of the fraction is more than 0.5 (scaled) and so it rounds up. The intermediate rounded values allow the FPU to carry the rightmost bit all the way over to the integer part and finally round to the correct answer.

This didn't happen because anyone added 0.5; the FPU just did the best it could within the limitations of the format. Floating point is not, actually, inaccurate. It's perfectly accurate, but most of the numbers we expect to see in our base-10, rational-number world-view are not representable by the base-2 fraction of the format. In fact, very few are.

How do computers store floating point numbers.?

There are multiple levels of representation of floating-point numbers. Here is information about the two levels people are most often interested in.

A floating-point number is ±significandbaseexponent with a fixed base and certain requirements on significand and exponent. The exponent is an integer within a range defined by the format, and the significand is a number representable by a numeral using some number of base-base digits, where the number of digits is defined by the format.

There may be variations on this basic format. For example, the significand may include a radix point (the generalization of a decimal point), so that a format might have significands that are all integers (137., 954., and son ) or that have the radix point at some other fixed location (often just after the first digit, so 1.37, 9.54, and so on). These variations are mathematically equivalent, with the exponent range adjusted to compensate.

Thus, +1.23456•1013 is a decimal floating-point number with six decimal digits. The point “floats” because we multiply by a power of the base that effectively moves the radix point.

At this level of representation, a floating-point format may include some special values, notably +∞, −∞, and NaNs (Not a Number “values” that indicate no number is represented).

The other level of most interest is encoding a floating-point number into bit strings. With IEEE-754 double-precision, it is done this way:

  • Write the number in the format ±significand•2exponent, where significand is represented as a 53-bit binary numeral with the radix point after the first digit and the exponent is in [−1022, +1023]. If the exponent is not −1022, the first digit of significand must be 1. (If it is not, subtract one from the exponent and shift the significand bits left one position, until either the exponent is −1022 or the first digit of significand is 1. Any number that cannot be put into this format is not representable in IEEE-754 double precision.)
  • For +, write “0”. For “−”, write “1”.
  • If the first digit of significand is zero, write “00000000000”. This is a special exponent code that represents subnormal numbers, meaning numbers that cannot be shifted to have a first bit of 1. If the first digit is not zero, add 1023 to the exponent, convert it to an eleven-digit binary numeral, and write that numeral. For example, the exponent −3 is biased to become 1020, the binary for that is “011111111100, so “011111111100” is written.
  • Write the 52 bits of the significand after the first digit.

But how do the real computers store or work with it internally?

Once we have encoded floating-point numbers as above, computers work with the parts. The significands behave largely like integers, and the exponents tell us about shifting them.

When two numbers of the same sign are added, their exponents are compared. The significand of one number is shifted to adjust its position relative to the other according to the difference in exponents. Then the two significands are added, and the result is adjusted (rounded if necessary) to fit into the floating-point format. When two numbers are multiplied, their significands are multiplied, and their exponents are added together. Subtraction, division, and other operations proceed in the same way, by operating on the parts of the representations.

There are various complications such as having to deal with bounds on the exponents, needed to shift significands to the normal form (leading digit is not zero) after arithmetic, and so on.

Is floating point math broken?

Binary floating point math is like this. In most programming languages, it is based on the IEEE 754 standard. The crux of the problem is that numbers are represented in this format as a whole number times a power of two; rational numbers (such as 0.1, which is 1/10) whose denominator is not a power of two cannot be exactly represented.

For 0.1 in the standard binary64 format, the representation can be written exactly as

  • 0.1000000000000000055511151231257827021181583404541015625 in decimal, or
  • 0x1.999999999999ap-4 in C99 hexfloat notation.

In contrast, the rational number 0.1, which is 1/10, can be written exactly as

  • 0.1 in decimal, or
  • 0x1.99999999999999...p-4 in an analogue of C99 hexfloat notation, where the ... represents an unending sequence of 9's.

The constants 0.2 and 0.3 in your program will also be approximations to their true values. It happens that the closest double to 0.2 is larger than the rational number 0.2 but that the closest double to 0.3 is smaller than the rational number 0.3. The sum of 0.1 and 0.2 winds up being larger than the rational number 0.3 and hence disagreeing with the constant in your code.

A fairly comprehensive treatment of floating-point arithmetic issues is What Every Computer Scientist Should Know About Floating-Point Arithmetic. For an easier-to-digest explanation, see floating-point-gui.de.

Side Note: All positional (base-N) number systems share this problem with precision

Plain old decimal (base 10) numbers have the same issues, which is why numbers like 1/3 end up as 0.333333333...

You've just stumbled on a number (3/10) that happens to be easy to represent with the decimal system, but doesn't fit the binary system. It goes both ways (to some small degree) as well: 1/16 is an ugly number in decimal (0.0625), but in binary it looks as neat as a 10,000th does in decimal (0.0001)** - if we were in the habit of using a base-2 number system in our daily lives, you'd even look at that number and instinctively understand you could arrive there by halving something, halving it again, and again and again.

Of course, that's not exactly how floating-point numbers are stored in memory (they use a form of scientific notation). However, it does illustrate the point that binary floating-point precision errors tend to crop up because the "real world" numbers we are usually interested in working with are so often powers of ten - but only because we use a decimal number system day-to-day. This is also why we'll say things like 71% instead of "5 out of every 7" (71% is an approximation, since 5/7 can't be represented exactly with any decimal number).

So no: binary floating point numbers are not broken, they just happen to be as imperfect as every other base-N number system :)

Side Side Note: Working with Floats in Programming

In practice, this problem of precision means you need to use rounding functions to round your floating point numbers off to however many decimal places you're interested in before you display them.

You also need to replace equality tests with comparisons that allow some amount of tolerance, which means:

Do not do if (x == y) { ... }

Instead do if (abs(x - y) < myToleranceValue) { ... }.

where abs is the absolute value. myToleranceValue needs to be chosen for your particular application - and it will have a lot to do with how much "wiggle room" you are prepared to allow, and what the largest number you are going to be comparing may be (due to loss of precision issues). Beware of "epsilon" style constants in your language of choice. These are not to be used as tolerance values.

Does computer rounds the numbers in an operation first or round the result?

Generally, the computer will convert the numerals in 9.4 - 9.0 - 0.4 to numbers in an internal form, and then it will perform the arithmetic operations. These conversions generally round their results.

Consider the text in source code 9.4 - 9.0 - 0.4. Nothing in there is a number. That text is a string composed of characters. It contains the characters “9”, ”.”, “4”, “ ”, “-”, and so on. Generally, a computer converts this text to other forms for processing. You could write software that works with numbers in a text format, but this is rare. Generally, when we are using a programming language, either compiled or interpreted, the numerals in this text will be converted to some internal form. (A “numeral” is a sequence of symbols representing a number. So “9.4” is a numeral representing 9.4.)

IEEE-754 binary64 is a very common floating-point format. In this format, each representable number is expressed in units of some power of two. For example, the numbers .125, .250, .375, and .500 are also representable because they are multiples of 1/8, which is 2−3. However, 9.4 is not a multiple of any power of two, so it cannot be represented in IEEE-754 binary64.

When 9.4 is converted to binary64, the nearest representable value is 9.4000000000000003552713678800500929355621337890625. (This is a multiple of 2−50, which is the power of two used when representing numbers near 9.4, specifically numbers from 8 [inclusive] to 16 [exclusive].)

9 is representable in binary64, so 9 is converted to 9.

0.4 is not representable in binary64. When 0.4 is converted to binary64, the nearest representable value is 0.40000000000000002220446049250313080847263336181640625. This is a multiple of 2−54, which is the power of two used for numbers from ¼ to ½.

In 9.4 - 9.0 - 0.4, the result of the first subtraction is 0.4000000000000003552713678800500929355621337890625. This is exactly representable, so there is no rounding at this point. Then, when 0.4 is subtracted, after it has been converted to the value above, the result is 0.00000000000000033306690738754696212708950042724609375. This is also exactly representable, so there is again no rounding at this point.

The above describes what happens if binary64 is used throughout. Many programming languages, or specific implementations of them, use binary64. Some may use other formats. Some languages permit implementations to use a mix of formats—they may use a wider format than binary64 for doing calculations and convert to binary64 for the final result. This can cause you to see different results than the above.

So the answer to your question is that, with floating-point arithmetic, each operation produces a result that is equal to the number you would get by computing the exact real-number result and then rounding that real-number results to the nearest value representable in the floating-point format. (Rounding is most often done by rounding to the nearest representable value, with ties resolved by one of several methods, but other rounding choices are possible, such as rounding down.)

The operations generally do not round their operands. (There are exceptions, such as that some processors may convert subnormal inputs to zero.) However, those operands must be produced first, such as by converting source text to a representable number. Those conversions are separate operations from the subtraction or other operations that follow.

How does floating-point arithmetic work when one is added to a big number?

Step by step:

IEEE-754 32-bit binary floating-point format:


sign 1 bit
significand 23 bits
exponent 8 bits

I) float a = 23400000000.f;

Convert 23400000000.f to float:


23,400,000,000 = 101 0111 0010 1011 1111 1010 1010 0000 00002
= 1.01011100101011111110101010000000002 • 234.

But the significand can store only 23 bits after the point. So we must round:


1.01011100101011111110101 010000000002 • 234
≈ 1.010111001010111111101012 • 234

So, after:

float a = 23400000000.f;

a is equal to 23,399,991,808.

II) float b = a + 1;


a = 101011100101011111110101000000000002.
b = 101011100101011111110101000000000012
= 1.01011100101011111110101000000000012 • 234.

But, again, the significand can store only 23 binary digits after the point. So we must round:


1.01011100101011111110101 000000000012 • 234
≈ 1.010111001010111111101012 • 234

So, after:

float b = a + 1;

b is equal to 23,399,991,808.

III) float c = b - a;


101011100101011111110101000000000002 - 101011100101011111110101000000000002 = 0

This value can be stored in a float without rounding.

So, after:

float c = b - a;

с is equal to 0.

how does floating point numbers get stored in memory

As far as I know flaoting numbers(for single precision) are stored in memory as follows:

  • sign s (denoting whether it's positive or negative) - 1 bit
  • mantissa m (essentially the digits of your number - 24 bits
  • exponent e - 8 bits

For example:

3.14159 would be represented like this:

0 10000100 11001001000011111100111
^ ^ ^
| | |
| | +--- significand = 0.7853975
| |
| +------------------- exponent = 4
|
+------------------------- sign = 0 (positive)

Do note that . is not stored at all in memory.

As a good reference read What Every Computer Scientist Should Know About Floating-Point Arithmetic and Floating Point

Why are floating point bases even?

The author may have made an overstatement, but given that bases 2 and 10 (binary and decimal) are the most common in use, it is at worst a philosophic faux pas. Per http://www.eecs.berkeley.edu/~wkahan/ieee754status/why-ieee.pdf: "Almost every machine that provides floating-point arithmetic does so in binary (radix 2), octal (8), decimal (10) or hexadecimal (16). Biological and historical accidents make 10 the preferred radix for machines whose arithmetic will be exposed to frequent scrutiny by humans. Otherwise binary is best. Radices bigger than 2 may offer a minuscule speed advantage during normalization because the leading few significant bits can sometimes remain zeros, but this advantage is more than offset by penalties in the range/precision trade-off and by 'wobbling precision'".

Odd bases do happen - though in computer technology, not so much. See http://mentalfloss.com/article/31879/12-mind-blowing-number-systems-other-languages, for instance. More food for thought at http://www.math.wichita.edu/history/topics/num-sys.html.

As a side note, an odd base would make halves, quarters, eighths, etc., impossible to represent exactly, and also would make trouble for tenths and hundreths.


The IEEE Standard for Floating-Point Arithmetic (IEEE 754) is for binary and decimal formats only: http://en.wikipedia.org/wiki/IEEE_floating_point. Prior formats that were used over the years are listed at http://www.mrob.com/pub/math/floatformats.html.

Binary, Floats, and Modern Computers

Computers are built on transistors, which have a "switched on" state, and a "switched off" state. This corresponds to high and low voltage. Pretty much all digital integrated circuits work in this binary fashion.

Ignoring the fact that transistors just simply work this way, using a different base (e.g. base 3) would require these circuits to operate at an intermediate voltage state (or several) as well as 0V and their highest operating voltage. This is more complicated, and can result in problems at high frequencies - how can you tell whether a signal is just transitioning between 2V and 0V, or actually at 1V?

When we get down to the floating point level, we are (as nhahtdh mentioned in their answer) mapping an infinite space of numbers down to a finite storage space. It's an absolute guarantee that we'll lose some precision. One advantage of IEEE floats, though, is that the precision is relative to the magnitude of the value.

Update: You should also check out Tunguska, a ternary computer emulator. It uses base-3 instead of base-2, which makes for some interesting (albeit mind-bending) concepts.



Related Topics



Leave a reply



Submit