How to Perform a Bitwise Operation on Floating Point Numbers

How to perform a bitwise operation on floating point numbers

At the language level, there's no such thing as "bitwise operation on floating-point numbers". Bitwise operations in C/C++ work on value-representation of a number. And the value-representation of floating point numbers is not defined in C/C++ (unsigned integers are an exception in this regard, as their shift is defined as-if they are stored in 2's complement). Floating point numbers don't have bits at the level of value-representation, which is why you can't apply bitwise operations to them.

All you can do is analyze the bit content of the raw memory occupied by the floating-point number. For that you need to either use a union as suggested below or (equivalently, and only in C++) reinterpret the floating-point object as an array of unsigned char objects, as in

float f = 5;
unsigned char *c = reinterpret_cast<unsigned char *>(&f);
// inspect memory from c[0] to c[sizeof f - 1]

And please, don't try to reinterpret a float object as an int object, as other answers suggest. That doesn't make much sense, and is not guaranteed to work in compilers that follow strict-aliasing rules in optimization. The correct way to inspect memory content in C++ is by reinterpreting it as an array of [signed/unsigned] char.

Also note that you technically aren't guaranteed that floating-point representation on your system is IEEE754 (although in practice it is unless you explicitly allow it not to be, and then only with respect to -0.0, ±infinity and NaN).

Bitwise operation on floating point numbers (for graphics)?

By the time the "graphics data" hits the screen, none of it is floating point. Bitwise operations are really done on bit strings. Bitwise operations only make sense on numbers because of consistent encoding scheme to binary. Trying to get any kind of logical bitwise operations on floats other than extracting the exponent or mantissa is a road to hell.

Basically, you probably don't want to do this. Why do you think you do?

Is it possible to perform bit operations on a float in Java?

No. We cannot bit shift in floats.

That operation simply flips the 31st bit (starting from zeroth). You can do that in java just fine.

int test(int uf) {
return uf ^ (1 << 31);
}

That's your function. The problem you have isn't setting that bit, it's knowing that if that number is negative that its actually a number 0xBFFFFFFF larger than that number. So long as you are consistent with treating ints as a sequence of 32 bits rather than checking it's value at in opportune times, you'll do fine.


If you want 32 bits that give you a proper number. Use a long.

long test(int uf) {
return (uf & 0x00000000ffffffffL) ^ (1 << 31);
}

Then you can use it in comparisons properly, and simply cast it to int if you need it as int. Which if we're converting the unsigned to other types is a far superior answer than floats.


Answer 2: We should certainly use bit operations in java as they are really fast in java. Quite often in mission critical code it's best to convert over to them at the critical points, to the point of only using primitives, or even native compiled code.

Floating point representation (using bitwise operators)

1) The use of the union, why ?

The bit operators are available for integral types only. You cannot convert the floating point number to an integer for obvious reasons. But a union locates the memory of the components overlapping. So by writing into the floating point component and then reading the integral component returns a integral representation of the floating point number. To make that clear: This is not the integral value of the floating point number. Using it as an integral number in calculations will give unexpected results. But you can access the bits of the integral number as it would be the bits of the floating point number.

2) MANTISSA_MASK and EXPONENET_MASK, what are they for?

Floating point numbers are represented by a number of bits specifying the mantissa (the digit string) and by an exponent part representing the "location" of the digits. After "conversion" of the floating point number into an integral type, this two parts are mixed in the integral value. MANTISSA_MASK and EXPONENT_MASK (you have a typo in your Q) masks out that parts. MANTISSA_BITS moves the exponent to the right place.

3) the use of & in here:

It is the bit and operator that masks out the bits.

Let's have an – completely virtual – example:

From your code you have 23 bits of mantissa and 8 bits of exponent. One bit of the 32 bits is reserved for the sign. Let's have a number:

00000001000010011010011010101010

Having 1 sign bit, 8 exponent bits and 23 mantissa bits you can read it like this

0 00100010 00010011010011010101010
s exponent --------mantissa-------

To get the mantissa you use a mask that only has the mantissa bits set:

0 00000000 11111111111111111111111

When you bit-and it, only bits that are 1 in both operands are 1, every other bit is 0:

0 00100010 00010011010011010101010 A
0 00000000 11111111111111111111111 B
- -------- -----------------------
0 00000000 00010011010011010101010 A&B

The mantissa is isolated from the exponent (and now a real integer value representing the mantissa.

To get the exponent, you first shift right the whole word so that the exponent starts from bit 0 (right most):

0 00100010 00010011010011010101010 
00000000000000000000000 0 00100010 >> 23 (mantissa bist)

To isolate the exponent from the sign bit, you have to bit-and it again:

00000000000000000000000 0 00100010 A
00000000000000000000000 0 11111111 B
------------------------------------
00000000000000000000000 0 00100010 A&B

Et voíla.

Bitwise operation on a floating point usefulness

A lot. For example when you only need to do bitwise operations on a floating-point only instruction set like AVX, then those become very handy.

Another application: making constants. You can see a lot of examples in table 13.10 and 13.11 in Agner Fog's optimization guide for x86 platforms. Some examples:

pcmpeqd xmm0, xmm0
psrld xmm0, 30 ; 3 (32-bit)

pcmpeqd xmm0, xmm0 ; -1

pcmpeqw xmm0, xmm0 ; 1.5f
pslld xmm0, 24
psrld xmm0, 2

pcmpeqw xmm0, xmm0 ; -2.0f
pslld xmm0, 30

You can also use that for checking if the floating-point value is a power of 2 or not.

Some other applications like Harold said: Taking absolute value and minus absolute value, copy sign, muxing... I'll demonstrate in a single data for easier understanding

// Absolute:
abs = x & ~(1U << 31);
// Muxing
v = (x & mask) | (y & ~mask); // v = mask ? x : y; with mask = 0 or -1
// Copy sign
y = (y & ~(1U << 31)) | (x & (1U << 31));

Preserving the floating point & addition of a bitwise operation in javascript

It doesn't work because the code assumes that the floating point numbers are represented as integer numbers, which they aren't. Floating point numbers are represented using the IEEE 754 standard, which breaks the numbers in three parts: a sign bit, a group of bits representing an exponent, and another group representing a number between 1 (inclusive) and 2 (exclusive), the mantissa, and the value is calculated as

(sign is set ? 1 : -1) * (mantissa ^ (exponent - bias))

Where the bias depends on the precision of the floating point number. So the algorithm you use for adding two numbers assumes that the bits represent an integer which is not the case for floating point numbers. Operations such as bitwise-AND and bitwise-OR also don't give the results that you'd expect in an integer world.

Some examples, in double precision, the number 2.3 is represented as (in hex) 4002666666666666, while the number 5.3 is represented as 4015333333333333. OR-ing those two numbers will give you 4017777777777777, which represents (roughly) 5.866666.

There are some good pointers on this format, I found the links at http://www.psc.edu/general/software/packages/ieee/ieee.php, http://babbage.cs.qc.edu/IEEE-754/ and http://www.binaryconvert.com/convert_double.html fairly good for understanding it.

Now, if you still want to implement the bitwise addition for those numbers, you can. But you'll have to break the number down in its parts, then normalize the numbers in the same exponent (otherwise you won't be able to add them), perform the addition on the mantissa, and finally normalize it back to the IEEE754 format. But, as @LukeGT said, you'll likely not get a better performance than the JS engine you're running. And some JS implementations don't even support bitwise operations on floating point numbers, so what usually ends up happening is that they first cast the numbers to integers, then perform the operation, which will make your results incorrect as well.

How to Multiply a Floating Point Number using Bitwise Operators Without the Multiplication Operator in C

You need to increment the exponent by 1 to double the floating-point value. That can be done by a ripple-carry adder:

#include <stdio.h>
#include <stdint.h>

int main()
{
float f = 3.14f ; // Test value
uint32_t* x = (int*)(&f) ; // get "bits view" of test value

// increment exponent
uint32_t mask = 0x00800000 ;
*x ^= mask ;
while( (*x & mask) == 0 &&
mask != 0x80000000 )
{
mask <<= 1 ;
*x ^= mask ;
}

// Show result
printf( "%f", f ) ;

return 0;
}

The output of the above code is :

6.280000

The solution does not deal with exponent overflow - that would require mantissa adjustment.



Related Topics



Leave a reply



Submit