32-Bit to 16-Bit Floating Point Conversion

32-bit to 16-bit Floating Point Conversion

std::frexp extracts the significand and exponent from normal floats or doubles -- then you need to decide what to do with exponents that are too large to fit in a half-precision float (saturate...?), adjust accordingly, and put the half-precision number together. This article has C source code to show you how to perform the conversion.

Convert 16-bit floating point to 32-bit floating point

There are no standard routines for this in C.

If the first DSP format is just Q7.8 as discussed in comments, then you can convert it to floating-point with:

#include <stdint.h>
…
int16_t Temp;
memcpy(&Temp, &PlaceWhereDataIs, sizeof Temp);
float Float = Temp * 0x1p-8f;

This simply puts the bits into a signed 16-bit integer, then converts it to float and scales it for eight fractional bits.

0x1p-8f is hexadecimal floating-point notation for a float constant with the value 2⁻⁸. If your compiler does not support that, you can use / 256.f instead of * 0x1p-8f.

If your compiler does not support int16_t, you can use short if it is 16 bits.

If the second is Q8.23, then it can similarly be converted with:

int32_t Temp;
memcpy(&Temp, &PlaceWhereDataIs, sizeof Temp);
float Float = Temp * 0x1p-23f;

However, its fields of 1, 8, and 23 match the field sizes of the common IEEE-754 basic 32-bit binary floating-point format, which makes me suspect it is a floating-point format, not a fixed-point format. In that case, you can get it into a float with:

float Float;
memcpy(&Float, &PlaceWhereDataIs, sizeof Float);

If the first DSP format is actually a floating-point format with 1 sign bit, 7 exponent bits, and 8 significand bits, then some work is needed to convert it. Additionally, you will have to supply details from documentation—custom floating-point formats tend to treat subnormals, infinities, and NaNs differently, as well as have non-standard exponent biases.

If both are fixed-point formats, and you can convert the first to the second with:

int16_t Temp;
memcpy(&Temp, &PlaceWhereDataIs, sizeof Temp);
int32_t Result = (int32_t) Temp * (1 << 23-8);

Bit shifting a half-float into a float

Here is code demonsrating the 16-bit floating-point to 32-bit floating-point conversion plus a test program. The test program requires Clang’s __fp16 type, but the conversion code does not. Handling of NaN payloads and signaling/non-signaling semantics is not tested.

#include <stdint.h>


//  Produce value of bit n.  n must be less than 32.
#define Bit(n)  ((uint32_t) 1 << (n))

//  Create a mask of n bits in the low bits.  n must be less than 32.
#define Mask(n) (Bit(n) - 1)


/*  Convert an IEEE-754 16-bit binary floating-point encoding to an IEEE-754
    32-bit binary floating-point encoding.

    This code has not been tested.
*/
uint32_t Float16ToFloat32(uint16_t x)
{
    /*  Separate the sign encoding (1 bit starting at bit 15), the exponent
        encoding (5 bits starting at bit 10), and the primary significand
        (fraction) encoding (10 bits starting at bit 0).
    */
    uint32_t s = x >> 15;
    uint32_t e = x >> 10 & Mask( 5);
    uint32_t f = x       & Mask(10);

    //  Left-adjust the significand field.
    f <<= 23 - 10;

    //  Switch to handle subnormal numbers, normal numbers, and infinities/NaNs.
    switch (e)
    {
        //  Exponent code is subnormal.
        case 0:
            //  Zero does need any changes, but subnormals need normalization.
            if (f != 0)
            {
                /*  Set the 32-bit exponent code corresponding to the 16-bit
                    subnormal exponent.
                */
                e = 1 + (127 - 15);

                /*  Normalize the significand by shifting until its leading
                    bit moves out of the field.  (This code could benefit from
                    a find-first-set instruction or possibly using a conversion
                    from integer to floating-point to do the normalization.)
                */
                while (f < Bit(23))
                {
                    f <<= 1;
                    e -= 1;
                }

                //  Remove the leading bit.
                f &= Mask(23);
            }
            break;

        // Exponent code is normal.
        default:
            e += 127 - 15;  //  Adjust from 16-bit bias to 32-bit bias.
            break;

        //  Exponent code indicates infinity or NaN.
        case 31:
            e = 255;        //  Set 32-bit exponent code for infinity or NaN.
            break;
    }

    //  Assemble and return the 32-bit encoding.
    return s << 31 | e << 23 | f;
}


#include <inttypes.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>


int main(void)
{
    //  Use unions so we can iterate and manipulate the encodings.
    union { uint16_t enc; __fp16 value; } x;
    union { uint32_t enc; float  value; } y;

    //  Iterate through all 16-bit encodings.
    for (uint32_t i = 0; i < Bit(16); ++i)
    {
        x.enc = i;
        y.enc = Float16ToFloat32(x.enc);
        if (isnan(x.value) != isnan(y.value) ||
            !isnan(x.value) && x.value != y.value)
        {
            printf("Failure:\n");
            printf("\tx encoding = 0x%04" PRIx16 ",     value = %.99g.\n",
                x.enc, x.value);
            printf("\ty encoding = 0x%08" PRIx32 ", value = %.99g.\n",
                y.enc, y.value);
            exit(EXIT_FAILURE);
        }
    }
}

As chtz points out, we can using 32-bit floating-point arithmetic to handle the scaling adjustment for both normal and subnormal values. To do this, replace the code in Float16ToFloat32 after f <<= 23 - 10; with:

    //  For infinities and NaNs, set 32-bit exponent code.
    if (e == 31)
        return s << 31 | 255 << 23 | f;

    /*  For finite values, reassemble with shifted fields and using a
        floating-point multiply to adjust for the changed exponent bias.
    */
    union { uint32_t enc; float  value; } y = { .enc = s << 31 | e << 23 | f };
    y.value *= 0x1p112f;
    return y.enc;

Converting a 32 bit wave form to a 16 bit wave form

Format 0x0003 is in fact ieeeFloat, you shouldn't get this exception. Better check the value it read. You cannot convert the values with bit shifting, you have to convert from float to short. A simple cast gets the job done.

Converting 32-bit number to 16 bits or less

I should be able to comment on this quite accurately. I used to do DSP processing work where we would "integerize" code, which effectively meant we'd take a signal/audio/video algorithm, and replace all the floating point logic with fixed point arithmetic (ie: Q_mn notation, etc).

On most modern systems, you'll usually get better performance using integer arithmetic, compared to floating point arithmetic, at the expense of more complicated code you have to write.

The Chip you are using (Cortex M3) doesn't have a dedicated hardware-based FPU: it only emulates floating point operations, so floating point operations are going to be expensive (take a lot of time).

In your case, you could just read the 16-bit value via read_u16(), and shift the value right 4 times, and you're done. If you're working with audio data, you might consider looking into companding algorithms (a-law, u-law), which will give a better subjective performance than simply chopping off the 4 LSBs to get a 12-bit number from a 16-bit number.

Yes, a float on that system is 32bit, and is likely represented in IEEE754 format. Multiplying a pair of 32-bit values versus a pair of 16-bit values may very well take the same amount of time, depending on the chip in use and the presence of an FPU and ALU. On your chip, multiplying two floats will be horrendously expensive in terms of time. Also, if you multiply two 32-bit integers, they could potentially overflow, so there is one potential reason to go with floating point logic if you don't want to implement a fixed-point algorithm.

convert single precision floating point to half precision floating point

Getting the biased exponent of -10, you need to create a denormalized number (with 0 in the exponent field), by shifting the mantissa bits right by 11. That gives you 00000 00000 11000... for the mantissa bits, which you then round up to 00000 00001 -- the smallest possible denorm number.

An IEEE fp number has a 1 bit sign, an n bit exponent field, and a m bit mantissa field. For the n bit exponent field, an all 1s value represent Inf or Nan and an all 0s value represents a denorm or zero (which depends on the mantissa bits). So only exponents in the range 1..2ⁿ-2 are valid for normalized numbers.

So when you calculate your "Normalized and biased" exponent, if it is ≤ 0, you need to generate a denorm (or zero) instead. The value for a normalized number is

-1^S(1.0 + 2^-mM)2^E-bias

(where M is the value in the mantissa field treated as an unsigned integer and m is the number of mantissa bits -- some descriptions write this as 1.M). The value for a denorm is

-1^S(0.0 + 2^-mM)2^1-bias

That is, the exponent is the same as for a biased exponent value of 1, but the "hidden bit" (the extra bit added to the top of the mantissa) is treated as 0 instead of 1. So to convert your normalized number with the (biased) exponent of -10 to a denorm, you need to shift the mantissa (including the hidden 1 bit that is normally not stored) by 1 - -10 bits (that is, 11 bits) to get the mantissa value you want for denorm. Since this will always shift by at least one bit (for any biased exponent ≤ 0), it will shift a 0 into the hidden bit position, matching the denorm meaning of the mantissa. If the exponent is small enough it will shift completely out of the mantissa, giving you a 0 mantissa (which is a zero). But in you specific case, even though it shifts entirely out of the 10 (representable in fp16 format) bits, the guard bits are still 1s, so it rounds up to 1.

Convert int to 16bit float (half precision floating point) in c++

It's a very straightforward thing, all the info you need is in Wikipedia.

Sample implementation:

#include <stdio.h>

unsigned int2hfloat(int x)
{
  unsigned sign = x < 0;
  unsigned absx = ((unsigned)x ^ -sign) + sign; // safe abs(x)
  unsigned tmp = absx, manbits = 0;
  int exp = 0, truncated = 0;

  // calculate the number of bits needed for the mantissa
  while (tmp)
  {
    tmp >>= 1;
    manbits++;
  }

  // half-precision floats have 11 bits in the mantissa.
  // truncate the excess or insert the lacking 0s until there are 11.
  if (manbits)
  {
    exp = 10; // exp bias because 1.0 is at bit position 10
    while (manbits > 11)
    {
      truncated |= absx & 1;
      absx >>= 1;
      manbits--;
      exp++;
    }
    while (manbits < 11)
    {
      absx <<= 1;
      manbits++;
      exp--;
    }
  }

  if (exp + truncated > 15)
  {
    // absx was too big, force it to +/- infinity
    exp = 31; // special infinity value
    absx = 0;
  }
  else if (manbits)
  {
    // normal case, absx > 0
    exp += 15; // bias the exponent
  }

  return (sign << 15) | ((unsigned)exp << 10) | (absx & ((1u<<10)-1));
}

int main(void)
{
  printf(" 0: 0x%04X\n", int2hfloat(0));
  printf("-1: 0x%04X\n", int2hfloat(-1));
  printf("+1: 0x%04X\n", int2hfloat(+1));
  printf("-2: 0x%04X\n", int2hfloat(-2));
  printf("+2: 0x%04X\n", int2hfloat(+2));
  printf("-3: 0x%04X\n", int2hfloat(-3));
  printf("+3: 0x%04X\n", int2hfloat(+3));
  printf("-2047: 0x%04X\n", int2hfloat(-2047));
  printf("+2047: 0x%04X\n", int2hfloat(+2047));
  printf("-2048: 0x%04X\n", int2hfloat(-2048));
  printf("+2048: 0x%04X\n", int2hfloat(+2048));
  printf("-2049: 0x%04X\n", int2hfloat(-2049)); // first inexact integer
  printf("+2049: 0x%04X\n", int2hfloat(+2049));
  printf("-2050: 0x%04X\n", int2hfloat(-2050));
  printf("+2050: 0x%04X\n", int2hfloat(+2050));
  printf("-32752: 0x%04X\n", int2hfloat(-32752));
  printf("+32752: 0x%04X\n", int2hfloat(+32752));
  printf("-32768: 0x%04X\n", int2hfloat(-32768));
  printf("+32768: 0x%04X\n", int2hfloat(+32768));
  printf("-65504: 0x%04X\n", int2hfloat(-65504)); // legal maximum
  printf("+65504: 0x%04X\n", int2hfloat(+65504));
  printf("-65505: 0x%04X\n", int2hfloat(-65505)); // infinity from here on
  printf("+65505: 0x%04X\n", int2hfloat(+65505));
  printf("-65535: 0x%04X\n", int2hfloat(-65535));
  printf("+65535: 0x%04X\n", int2hfloat(+65535));
  return 0;
}

Output (ideone):

 0: 0x0000
-1: 0xBC00
+1: 0x3C00
-2: 0xC000
+2: 0x4000
-3: 0xC200
+3: 0x4200
-2047: 0xE7FF
+2047: 0x67FF
-2048: 0xE800
+2048: 0x6800
-2049: 0xE800
+2049: 0x6800
-2050: 0xE801
+2050: 0x6801
-32752: 0xF7FF
+32752: 0x77FF
-32768: 0xF800
+32768: 0x7800
-65504: 0xFBFF
+65504: 0x7BFF
-65505: 0xFC00
+65505: 0x7C00
-65535: 0xFC00
+65535: 0x7C00

Float32 to Float16

The exponents in your float32 and float16 representations are probably biased, and biased differently. You need to unbias the exponent you got from the float32 representation to get the actual exponent, and then to bias it for the float16 representation.

Apart from this detail, I do think it's as simple as that, but I still get surprised by floating-point representations from time to time.

EDIT:

Check for overflow when doing the thing with the exponents while you're at it.
Your algorithm truncates the last bits of the mantisa a little abruptly, that may be acceptable but you may want to implement, say, round-to-nearest by looking at the bits that are about to be discarded. "0..." -> round down, "100..001..." -> round up, "100..00" -> round to even.

Convert 32-bit float raster to 16-bit in R

In your example you have real values between 0 and 1. By writing to an integer type, these are all truncated to 0. If you want to write to INT2U (unsigned byte) you could first scale the values to be between 0 and 255.

Example data

library(raster)
b <- brick(system.file("external/rlogo.grd", package="raster"))
image <- clamp(b/255, 0, 0.6)
#class      : RasterBrick 
#dimensions : 77, 101, 7777, 3  (nrow, ncol, ncell, nlayers)
#resolution : 1, 1  (x, y)
#extent     : 0, 101, 0, 77  (xmin, xmax, ymin, ymax)
#crs        : +proj=merc +lon_0=0 +k=1 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs 
#source     : memory
#names      : red, green, blue 
#min values :   0,     0,    0 
#max values : 0.6,   0.6,  0.6

What you do (truncation)

fname <- paste0(tempfile(), ".tif")      
x <- writeRaster(image, fname, datatype='INT2U', overwrite=TRUE)
x
#min values : 0, 0, 0 
#max values : 1, 1, 1

But note the difference when you use rounding

fname <- paste0(tempfile(), ".tif")      
y <- round(image)
z <- writeRaster(y, fname, datatype='INT2U', overwrite=TRUE)

s <- stack(x[[1]], z[[1]]) 
plot(s)

Now with some scaling

maxv <- 65535
r <- round(image * maxv)
fname <- paste0(tempfile(), ".tif")
s <- writeRaster(r, fname, datatype='INT2U', overwrite=TRUE)

#s
#min values :     0,     0,     0 
#max values : 39321, 39321, 39321

With your data you would get max values of

round(maxv * c(0.5347134, 0.3522218, 0.4896736, 0.7308348 ))
#[1] 35042 23083 32091 47895

You can also choose to set the max values of all layers to maxv, to keep more variation (but make the values no longer comparable with other data) --- more relevant if you were using a smaller range such as 0-255.

ss <- round(maxv * image / maxValue(image))
#names      :   red, green,  blue 
#min values :     0,     0,     0 
#max values : 65535, 65535, 65535

the above works because the lowest values are zero; if you have negative values you would do

ss <- image - minValue(image)
ss <- round(maxv * ss / maxValue(ss))

In other cases you might want to use clamp. So you need to decide how to scale

What I showed is linear scaling. There are other ways. For example, there is the is also the scale method, to improve the statistical distribution of numbers. That may be relevant; but it depends on what your goals are.

Note. You do not say why you do this, and that is OK. But if it is to save hard disk space, you can use compression instead (see ?writeRaster)

32-Bit to 16-Bit Floating Point Conversion