How to Convert Bytes to Half-Floats in Swift

Convert half precision float (bytes) to float in Swift

If you have an array of half-precision data, you can convert all of it to float at once using vImageConvert_Planar16FtoPlanarF, which is provided by Accelerate.framework:

import Accelerate
let n = 2
var input: [UInt16] = [ 0x3c00, 0xbc00 ]
var output = [Float](count: n, repeatedValue: 0)
var src = vImage_Buffer(data:&input, height:1, width:UInt(n), rowBytes:2*n)
var dst = vImage_Buffer(data:&output, height:1, width:UInt(n), rowBytes:4*n)
vImageConvert_Planar16FtoPlanarF(&src, &dst, 0)
// output now contains [1.0, -1.0]

You can also use this method to convert individual values, but it's fairly heavyweight if that's all that you're doing; on the other hand it's extremely efficient if you have large buffers of values to convert.

If you need to convert isolated values, you might put something like the following C function in your bridging header and use it from Swift:

#include <stdint.h>
static inline float loadFromF16(const uint16_t *pointer) { return *(const __fp16 *)pointer; }

This will use hardware conversion instructions when you're compiling for targets that have them (armv7s, arm64, x86_64h), and call a reasonably good software conversion routine when compiling for targets that don't have hardware support.

addendum: going the other way

You can convert float to half-precision in pretty much the same way:

static inline storeAsF16(float value, uint16_t *pointer) { *(const __fp16 *)pointer = value; }

Or use the function vImageConvert_PlanarFtoPlanar16F.

How to convert bytes to half-floats in Swift?

There is no 16-bit floating point type in Swift, but you can convert
the results to 32-bit floating point numbers (Float).
This thread

32-bit to 16-bit Floating Point Conversion

contains a lot of information about the
Half-precision floating-point format, and various conversion methods. The crucial hint however is in Ian Ollman's answer:

On OS X / iOS, you can use vImageConvert_PlanarFtoPlanar16F and
vImageConvert_Planar16FtoPlanarF. See Accelerate.framework.

Ian did provide no code however, so here is a possible implementation
in Swift:

func areaHistogram(image : UIImage) {
    
    let inputImage = CIImage(image: image)
    
    let totalBytes : Int = bpp * BINS //8 * 64 for example
    let bitmap = calloc(1, totalBytes)
    
    let filter = CIFilter(name: "CIAreaHistogram")!
    filter.setValue(inputImage, forKey: kCIInputImageKey)
    filter.setValue(CIVector(x: 0, y: 0, z: image.size.width, w: image.size.height), forKey: kCIInputExtentKey)
    filter.setValue(BINS, forKey: "inputCount") 
    filter.setValue(1, forKey: "inputScale")
    
    let myEAGLContext = EAGLContext(API: .OpenGLES2)
    let options = [kCIContextWorkingColorSpace : kCFNull]
    let context : CIContext = CIContext(EAGLContext: myEAGLContext, options: options)
    context.render(filter.outputImage!, toBitmap: bitmap, rowBytes: totalBytes, bounds: filter.outputImage!.extent, format: kCIFormatRGBAh, colorSpace: CGColorSpaceCreateDeviceRGB())

    // *** CONVERSION FROM 16-bit TO 32-bit FLOAT ARRAY STARTS HERE ***
    
    let comps = 4 // Number of components (RGBA)
    
    // Array for the RGBA values of the histogram: 
    var rgbaFloat = [Float](count: comps * BINS, repeatedValue: 0)
    
    // Source and image buffer structure for vImage conversion function:
    var srcBuffer = vImage_Buffer(data: bitmap, height: 1, width: UInt(comps * BINS), rowBytes: bpp * BINS)
    var dstBuffer = vImage_Buffer(data: &rgbaFloat, height: 1, width: UInt(comps * BINS), rowBytes: comps * sizeof(Float) * BINS)
    
    // Half-precision float to Float conversion of entire buffer:
    if vImageConvert_Planar16FtoPlanarF(&srcBuffer, &dstBuffer, 0) == kvImageNoError {
        for bin in 0 ..< BINS {
            let R = rgbaFloat[comps * bin + 0]
            let G = rgbaFloat[comps * bin + 1]
            let B = rgbaFloat[comps * bin + 2]
            print("R/G/B = \(R) \(G) \(B)")
        }
    }
    
    free(bitmap)
}

Remarks:

You need to import Accelerate.
Note that your code allocates totalBytes * bpp bytes instead
of the necessary totalBytes.
The kCIFormatRGBAh pixel format is not supported on the Simulator (as of Xcode 7), so you have to test the code on a real device.

Update: Swift 5.3 (Xcode 12, currently in beta) introduces a new Float16 type which is available in iOS 14, see SE-0277 Float16 on Swift Evolution.

This simplifies the code because a conversion to Float is no longer necessary. I have also removed the use of OpenGL functions which are deprecated as of iOS 12:

func areaHistogram(image: UIImage, bins: Int) -> [Float16] {

    let comps = 4 // Number of components (RGBA)

    let inputImage = CIImage(image: image)
    var rgbaFloat = [Float16](repeating: 0, count: comps * bins)
    let totalBytes = MemoryLayout<Float16>.size * comps * bins

    let filter = CIFilter(name: "CIAreaHistogram")!
    filter.setValue(inputImage, forKey: kCIInputImageKey)
    filter.setValue(CIVector(x: 0, y: 0, z: image.size.width, w: image.size.height), forKey: kCIInputExtentKey)
    filter.setValue(bins, forKey: "inputCount")
    filter.setValue(1, forKey: "inputScale")

    let options: [CIContextOption : Any] = [.workingColorSpace : NSNull()]
    let context = CIContext(options: options)
    
    rgbaFloat.withUnsafeMutableBytes {
        context.render(filter.outputImage!, toBitmap: $0.baseAddress!, rowBytes: totalBytes,
                       bounds: filter.outputImage!.extent, format: CIFormat.RGBAh,
                       colorSpace: CGColorSpaceCreateDeviceRGB())
    }
    return rgbaFloat
}

32-bit to 16-bit Floating Point Conversion

std::frexp extracts the significand and exponent from normal floats or doubles -- then you need to decide what to do with exponents that are too large to fit in a half-precision float (saturate...?), adjust accordingly, and put the half-precision number together. This article has C source code to show you how to perform the conversion.

Half-precision floating-point in Java

You can Use Float.intBitsToFloat() and Float.floatToIntBits() to convert them to and from primitive float values. If you can live with truncated precision (as opposed to rounding) the conversion should be possible to implement with just a few bit shifts.

I have now put a little more effort into it and it turned out not quite as simple as I expected at the beginning. This version is now tested and verified in every aspect I could imagine and I'm very confident that it produces the exact results for all possible input values. It supports exact rounding and subnormal conversion in either direction.

// ignores the higher 16 bits
public static float toFloat( int hbits )
{
    int mant = hbits & 0x03ff;            // 10 bits mantissa
    int exp =  hbits & 0x7c00;            // 5 bits exponent
    if( exp == 0x7c00 )                   // NaN/Inf
        exp = 0x3fc00;                    // -> NaN/Inf
    else if( exp != 0 )                   // normalized value
    {
        exp += 0x1c000;                   // exp - 15 + 127
        if( mant == 0 && exp > 0x1c400 )  // smooth transition
            return Float.intBitsToFloat( ( hbits & 0x8000 ) << 16
                                            | exp << 13 | 0x3ff );
    }
    else if( mant != 0 )                  // && exp==0 -> subnormal
    {
        exp = 0x1c400;                    // make it normal
        do {
            mant <<= 1;                   // mantissa * 2
            exp -= 0x400;                 // decrease exp by 1
        } while( ( mant & 0x400 ) == 0 ); // while not normal
        mant &= 0x3ff;                    // discard subnormal bit
    }                                     // else +/-0 -> +/-0
    return Float.intBitsToFloat(          // combine all parts
        ( hbits & 0x8000 ) << 16          // sign  << ( 31 - 15 )
        | ( exp | mant ) << 13 );         // value << ( 23 - 10 )
}

// returns all higher 16 bits as 0 for all results
public static int fromFloat( float fval )
{
    int fbits = Float.floatToIntBits( fval );
    int sign = fbits >>> 16 & 0x8000;          // sign only
    int val = ( fbits & 0x7fffffff ) + 0x1000; // rounded value

    if( val >= 0x47800000 )               // might be or become NaN/Inf
    {                                     // avoid Inf due to rounding
        if( ( fbits & 0x7fffffff ) >= 0x47800000 )
        {                                 // is or must become NaN/Inf
            if( val < 0x7f800000 )        // was value but too large
                return sign | 0x7c00;     // make it +/-Inf
            return sign | 0x7c00 |        // remains +/-Inf or NaN
                ( fbits & 0x007fffff ) >>> 13; // keep NaN (and Inf) bits
        }
        return sign | 0x7bff;             // unrounded not quite Inf
    }
    if( val >= 0x38800000 )               // remains normalized value
        return sign | val - 0x38000000 >>> 13; // exp - 127 + 15
    if( val < 0x33000000 )                // too small for subnormal
        return sign;                      // becomes +/-0
    val = ( fbits & 0x7fffffff ) >>> 23;  // tmp exp for subnormal calc
    return sign | ( ( fbits & 0x7fffff | 0x800000 ) // add subnormal bit
         + ( 0x800000 >>> val - 102 )     // round depending on cut off
      >>> 126 - val );   // div by 2^(1-(exp-127+15)) and >> 13 | exp=0
}

I implemented two small extensions compared to the book because the general precision for 16 bit floats is rather low which could make the inherent anomalies of floating point formats visually perceivable compared to larger floating point types where they are usually not noticed due to the ample precision.

The first one are these two lines in the toFloat() function:

if( mant == 0 && exp > 0x1c400 )  // smooth transition
    return Float.intBitsToFloat( ( hbits & 0x8000 ) << 16 | exp << 13 | 0x3ff );

Floating point numbers in the normal range of the type size adopt the exponent and thus the precision to the magnitude of the value. But this is not a smooth adoption, it happens in steps: switching to the next higher exponent results in half the precision. The precision now remains the same for all values of the mantissa until the next jump to the next higher exponent. The extension code above makes these transitions smoother by returning a value that is in the geographical center of the covered 32 bit float range for this particular half float value. Every normal half float value maps to exactly 8192 32 bit float values. The returned value is supposed to be exactly in the middle of these values. But at the transition of the half float exponent the lower 4096 values have twice the precision as the upper 4096 values and thus cover a number space that is only half as large as on the other side. All these 8192 32 bit float values map to the same half float value, so converting a half float to 32 bit and back results in the same half float value regardless of which of the 8192 intermediate 32 bit values was chosen. The extension now results in something like a smoother half step by a factor of sqrt(2) at the transition as shown at the right picture below while the left picture is supposed to visualize the sharp step by a factor of two without anti aliasing. You can safely remove these two lines from the code to get the standard behavior.

covered number space on either side of the returned value:
       6.0E-8             #######                  ##########
       4.5E-8             |                       #
       3.0E-8     #########               ########

The second extension is in the fromFloat() function:

    {                                     // avoid Inf due to rounding
        if( ( fbits & 0x7fffffff ) >= 0x47800000 )
...
        return sign | 0x7bff;             // unrounded not quite Inf
    }

This extension slightly extends the number range of the half float format by saving some 32 bit values form getting promoted to Infinity. The affected values are those that would have been smaller than Infinity without rounding and would become Infinity only due to the rounding. You can safely remove the lines shown above if you don't want this extension.

I tried to optimize the path for normal values in the fromFloat() function as much as possible which made it a bit less readable due to the use of precomputed and unshifted constants. I didn't put as much effort into 'toFloat()' since it would not exceed the performance of a lookup table anyway. So if speed really matters could use the toFloat() function only to fill a static lookup table with 0x10000 elements and than use this table for the actual conversion. This is about 3 times faster with a current x64 server VM and about 5 times faster with the x86 client VM.

I put the code hereby into public domain.

4 bytes to a Float in a swift gives an unexpectedly small result

You are on a little-endian platform, so your array is equivalent to the 32-bit integer 0x00000019, which, as an IEEE single precision floating-point number, is approximately 3.5 * 10^-44.

Converting two bytes to an IEEE-11073 16-bit SFLOAT in C#

Loosely based on the C implementation by Signove on GitHub I have created this function in C#:

Dictionary<Int32, Single> reservedValues = new Dictionary<Int32, Single> {
  { 0x07FE, Single.PositiveInfinity },
  { 0x07FF, Single.NaN },
  { 0x0800, Single.NaN },
  { 0x0801, Single.NaN },
  { 0x0802, Single.NegativeInfinity }
};

Single Ieee11073ToSingle(Byte[] bytes) {
  var ieee11073 = (UInt16) (bytes[0] + 0x100*bytes[1]);
  var mantissa = ieee11073 & 0x0FFF;
  if (reservedValues.ContainsKey(mantissa))
    return reservedValues[mantissa];
  if (mantissa >= 0x0800)
    mantissa = -(0x1000 - mantissa);
  var exponent = ieee11073 >> 12;
  if (exponent >= 0x08)
    exponent = -(0x10 - exponent);
  var magnitude = Math.Pow(10d, exponent);
  return (Single) (mantissa*magnitude);
}

This function assumes that the bytes are in little endian format. If not you will have to swap bytes[0] and bytes[1] in the first line of the function. Or perhaps even better remove the first line from the function and change the function argument to accept a UInt16 (the IEEE 11073 value) and then let the caller decide how to extract this value from the input.

I highly advise you to test this code because I do not have any test values to verify the correctnes of the conversion.

Get raw bytes of a float in Swift

Update for Swift 3: As of Swift 3, all floating point types
have bitPattern property which returns an unsigned integer with
the same memory representation, and a corresponding init(bitPattern:)
constructor for the opposite conversion.

Example: Float to UInt32:

let x = Float(1.5)
let bytes1 = x.bitPattern
print(String(format: "%#08x", bytes1)) // 0x3fc00000

Example: UInt32 to Float:

let bytes2 = UInt32(0x3fc00000)
let y = Float(bitPattern: bytes2)
print(y) // 1.5

In the same way you can convert between Double and UInt64,
or between CGFloat and UInt.

Old answer for Swift 1.2 and Swift 2: The Swift floating point types have a _toBitPattern() method:

let x = Float(1.5)
let bytes1 = x._toBitPattern()
print(String(format: "%#08x", bytes1)) // 0x3fc00000

let bytes2: UInt32 = 0b00111111110000000000000000000000
print(String(format: "%#08x", bytes2)) // 0x3fc00000

print(bytes1 == bytes2) // true

This method is part of the FloatingPointType protocol
to which Float, Double and CGFloat conform:

/// A set of common requirements for Swift's floating point types.
protocol FloatingPointType : Strideable {
    typealias _BitsType
    static func _fromBitPattern(bits: _BitsType) -> Self
    func _toBitPattern() -> _BitsType

    // ...
}

(As of Swift 2, these definition are not visible anymore in the
API documentation, but they still exist and work as before.)

The actual definition of _BitsType is not visible in the API
documentation, but it is UInt32 for Float, UInt64 for
Double, and Int for CGFloat:

print(Float(1.0)._toBitPattern().dynamicType)
// Swift.UInt32

print(Double(1.0)._toBitPattern().dynamicType)
// Swift.UInt64

print(CGFloat(1.0)._toBitPattern().dynamicType)
// Swift.UInt

_fromBitPattern() can be used for the conversion into the other
direction:

let y = Float._fromBitPattern(0x3fc00000)
print(y) // 1.5

Converting float to hex results in more digits than expected in C#?

EDIT: Oh dear oh dear, I completely missed this, which is the short answer:

Use BitConverter.GetBytes and pass it a float, as shown here.

The long answer:

BitConverter doesn't support single precision floats, just doubles. You'll have to create a C# "union", like so:

[StructLayout(LayoutKind.Explicit)]
class Floater
{
    [FieldOffset(0)]
    public float theFloat;
    [FieldOffset(0)]
    public int theInt;
}

Put your float in theFloat and look at theInt

How to make use of kCIFormatRGBAh to get half floats on iOS with Core Image?

There are two constraints on using RGBAh with [CIContext render:toBitmap:rowBytes:bounds:format:colorSpace:] on iOS

the rowBytes must be a multiple of 8 bytes
calling it under simulator is not supported

These constraints come from the behavior of OpenGLES with RGBAh on iOS.

Converting float to double

Platform considerations

This depends on platform used for float computation. With x87 FPU the conversion is free, as the register content is the same - the only price you may sometimes pay is the memory traffic, but in many cases there is even no traffic, as you can simply use the value without any conversion. x87 is actually a strange beast in this respect - it is hard to properly distinguish between floats and doubles on it, as the instructions and registers used are the same, what is different are load/store instructions and computation precision itself is controlled using status bits. Using mixed float/double computations may result in unexpected results (and there are compiler command line options to control exact behaviour and optimization strategies because of this).

When you use SSE (and sometimes Visual Studio uses SSE by default), it may be different, as you may need to transfer the value in the FPU registers or do something explicit to perform the conversion.

Memory savings performance

As a summary, and answering to your comment elsewhere: if you want to store results of floating computations into 32b storage, the result will be same speed or faster, because:

If you do this on x87, the conversion is free - the only difference will be fstp dword[] will be used instead of fstp qword[].
If you do this with SSE enabled, you may even see some performance gain, as some float computations can be done with SSE once the precision of the computation is only float insteead of default double.
In all cases the memory traffic is lower