Avoiding Denormal Values in C++

Avoiding denormal values in C++

You can test whether a float is denormal using

#include <cmath>

if ( std::fpclassify( flt ) == FP_SUBNORMAL )

(Caveat: I'm not sure that this will execute at full speed in practice.)

In C++03, and this code has worked for me in practice,

#include <cmath>
#include <limits>

if ( flt != 0 && std::fabsf( flt ) < std::numeric_limits<float>::min() ) {
    // it's denormalized
}

To decide where to apply this, you may use a sample-based analyzer like Shark, VTune, or Zoom, to highlight the instructions slowed by denormal values. Micro-optimization, even more than other optimizations, is totally hopeless without analysis both before and after.

What is this denormal data about ? - C++

You ask about C++, but the specifics of floating-point values and encodings are determined by a floating-point specification, notably IEEE 754, and not by C++. IEEE 754 is by far the most widely used floating-point specification, and I will answer using it.

In IEEE 754, binary floating-point values are encoded with three parts: A sign bit s (0 for positive, 1 for negative), a biased exponent e (the represented exponent plus a fixed offset), and a significand field f (the fraction portion). For normal numbers, these represent exactly the number (−1)^s • 2^e−bias • 1.f, where 1.f is the binary numeral formed by writing the significand bits after “1.”. (For example, if the significand field has the ten bits 0010111011, it represents the significand 1.0010111011₂, which is 1.182617175 or 1211/1024.)

The bias depends on the floating-point format. For 64-bit IEEE 754 binary, the exponent field has 11 bits, and the bias is 1023. When the actual exponent is 0, the encoded exponent field is 1023. Actual exponents of −2, −1, 0, 1, and 2 have encoded exponents of 1021, 1022, 1023, 1024, and 1025. When somebody speaks of the exponent of a subnormal number being zero they mean the encoded exponent is zero. The actual exponent would be less than −1022. For 64-bit, the normal exponent interval is −1022 to 1023 (encoded values 1 to 2046). When the exponent moves outside this interval, special things happen.

Above this interval, floating-point stops representing finite numbers. An encoded exponent of 2047 (all 1 bits) represents infinity (with the significand field set to zero). Below this range, floating-point changes to subnormal numbers. When the encoded exponent is zero, the significand field represents 0.f instead of 1.f.

There is an important reason for this. If the lowest exponent value were just another normal encoding, then the lower bits of its significand would be too small to represent as a floating-point values by themselves. Without that leading “1.”, there would be no way to say where the first 1 bit was. For example, suppose you had two numbers, both with the lowest exponent, and with significands 1.0010111011₂ and 1.0000000000₂. When you subtract the significands, the result is .0010111011₂. Unfortunately, there is no way to represent this as a normal number. Because you were already at the lowest exponent, you cannot represent the lower exponent that is needed to say where the first 1 is in this result. Since the mathematical result is too small to be represented, a computer would be forced to return the nearest representable number, which would be zero.

This creates the undesirable property in the floating-point system that you can have a != b but a-b == 0. To avoid that, subnormal numbers are used. By using subnormal numbers, we have a special interval where the actual exponent does not decrease, and we can perform arithmetic without creating numbers too small to represent. When the encoded exponent is zero, the actual exponent is the same as when the encoded exponent is one, but the value of the significand changes to 0.f instead of 1.f. When we do this, a != b guarantees that the computed value of a-b is not zero.

Here are the combinations of values in the encodings of 64-bit IEEE 754 binary floating-point:

Sign	Exponent (e)	Significand Bits (f)	Meaning
0	0	0	+zero
0	0	Non-zero	+2⁻¹⁰²²•0.f (subnormal)
0	1 to 2046	Anything	+2^e−1023•1.f (normal)
0	2047	0	+infinity
0	2047	Non-zero but high bit off	+, signaling NaN
0	2047	High bit on	+, quiet NaN
1	0	0	−zero
1	0	Non-zero	−2⁻¹⁰²²•0.f (subnormal)
1	1 to 2046	Anything	−2^e−1023•1.f (normal)
1	2047	0	−infinity
1	2047	Non-zero but high bit off	−, signaling NaN
1	2047	High bit on	−, quiet NaN

Denormalized floating point in Objective-C?

As I said in response to your comment there:

it is more of a CPU than a language issue, so it probably has
relevance for Objective-C on x86. (iPhone's ARMv7 doesn't seem to support
denormalized floats, at least with the default runtime/build settings)

Update

I just tested. On Mac OS X on x86 the slowdown is observed, on iOS on ARMv7 it is not (default build settings).

And as to be expected, running on iOS simulator (on x86) denormalized floats appear again.

Interestingly, FLT_MIN and DBL_MIN respectively are defined to the smallest non-denormalized number (on iOS, Mac OS X, and Linux). Strange things happen using

DBL_MIN/2.0

in your code; the compiler happily sets a denormalized constant, but as soon as the (arm) CPU touches it, it is set to zero:

double test = DBL_MIN/2.0;
printf("test      == 0.0 %d\n",test==0.0);
printf("DBL_MIN/2 == 0.0 %d\n",DBL_MIN/2.0==0.0);

Outputs:

test      == 0.0 1  // computer says YES
DBL_MIN/2 == 0.0 0  // compiler says NO

So a quick runtime check if denormalization is supported can be:

#define SUPPORT_DENORMALIZATION ({volatile double t=DBL_MIN/2.0;t!=0.0;})

("given without even the implied warranty of fitness for any purpose")

This is what ARM has to say on flush to zero mode: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204h/Bcfheche.html

Update<<1

This is how you disable flush to zero mode on ARMv7:

int x;
asm(
    "vmrs %[result],FPSCR \r\n"
    "bic %[result],%[result],#16777216 \r\n"
    "vmsr FPSCR,%[result]"
    :[result] "=r" (x) : :
);
printf("ARM FPSCR: %08x\n",x);

with the following surprising result.

Column 1: a float, divided by 2 for every iteration
Column 2: the binary representation of this float
Column 3: the time taken to sum this float 1e7 times

You can clearly see that the denormalization comes at zero cost. (For an iPad 2. On iPhone 4, it comes at a small cost of a 10% slowdown.)

0.000000000000000000000000000000000100000004670110: 10111100001101110010000011100000 110 ms
0.000000000000000000000000000000000050000002335055: 10111100001101110010000101100000 110 ms
0.000000000000000000000000000000000025000001167528: 10111100001101110010000001100000 110 ms
0.000000000000000000000000000000000012500000583764: 10111100001101110010000110100000 110 ms
0.000000000000000000000000000000000006250000291882: 10111100001101110010000010100000 111 ms
0.000000000000000000000000000000000003125000145941: 10111100001101110010000100100000 110 ms
0.000000000000000000000000000000000001562500072970: 10111100001101110010000000100000 110 ms
0.000000000000000000000000000000000000781250036485: 10111100001101110010000111000000 110 ms
0.000000000000000000000000000000000000390625018243: 10111100001101110010000011000000 110 ms
0.000000000000000000000000000000000000195312509121: 10111100001101110010000101000000 110 ms
0.000000000000000000000000000000000000097656254561: 10111100001101110010000001000000 110 ms
0.000000000000000000000000000000000000048828127280: 10111100001101110010000110000000 110 ms
0.000000000000000000000000000000000000024414063640: 10111100001101110010000010000000 110 ms
0.000000000000000000000000000000000000012207031820: 10111100001101110010000100000000 111 ms
0.000000000000000000000000000000000000006103515209: 01111000011011100100001000000000 110 ms
0.000000000000000000000000000000000000003051757605: 11110000110111001000010000000000 110 ms
0.000000000000000000000000000000000000001525879503: 00010001101110010000100000000000 110 ms
0.000000000000000000000000000000000000000762939751: 00100011011100100001000000000000 110 ms
0.000000000000000000000000000000000000000381469876: 01000110111001000010000000000000 112 ms
0.000000000000000000000000000000000000000190734938: 10001101110010000100000000000000 110 ms
0.000000000000000000000000000000000000000095366768: 00011011100100001000000000000000 110 ms
0.000000000000000000000000000000000000000047683384: 00110111001000010000000000000000 110 ms
0.000000000000000000000000000000000000000023841692: 01101110010000100000000000000000 111 ms
0.000000000000000000000000000000000000000011920846: 11011100100001000000000000000000 110 ms
0.000000000000000000000000000000000000000005961124: 01111001000010000000000000000000 110 ms
0.000000000000000000000000000000000000000002980562: 11110010000100000000000000000000 110 ms
0.000000000000000000000000000000000000000001490982: 00010100001000000000000000000000 110 ms
0.000000000000000000000000000000000000000000745491: 00101000010000000000000000000000 110 ms
0.000000000000000000000000000000000000000000372745: 01010000100000000000000000000000 110 ms
0.000000000000000000000000000000000000000000186373: 10100001000000000000000000000000 110 ms
0.000000000000000000000000000000000000000000092486: 01000010000000000000000000000000 110 ms
0.000000000000000000000000000000000000000000046243: 10000100000000000000000000000000 111 ms
0.000000000000000000000000000000000000000000022421: 00001000000000000000000000000000 110 ms
0.000000000000000000000000000000000000000000011210: 00010000000000000000000000000000 110 ms
0.000000000000000000000000000000000000000000005605: 00100000000000000000000000000000 111 ms
0.000000000000000000000000000000000000000000002803: 01000000000000000000000000000000 110 ms
0.000000000000000000000000000000000000000000001401: 10000000000000000000000000000000 110 ms
0.000000000000000000000000000000000000000000000000: 00000000000000000000000000000000 110 ms
0.000000000000000000000000000000000000000000000000: 00000000000000000000000000000000 110 ms
0.000000000000000000000000000000000000000000000000: 00000000000000000000000000000000 110 ms

Performance penalty: denormalized numbers versus branch mis-predictions

There's HW support for this for free in many ISAs including x86, see below re: FTZ / DAZ. Most compilers set those flags during startup when you compile with -ffast-math or equivalent.

Also note that your code fails to avoid the penalty (on HW where there is any) in some cases: y * y or z * z can be subnormal for small but normalized y or z. (Good catch, @chtz). The exponent of y*y is twice the exponent of y, more negative or more positive. With 23 explicit mantissa bits in a float, that's about 12 exponent values that are the square roots of subnormal values, and wouldn't underflow all the way to 0.

Squaring a subnormal always gives underflow to 0; subnormal input may be less likely to have a penalty than subnormal output for a multiply, I don't know. Having a subnormal penalty or not can vary by operation within one microarchitecture, like add/sub vs. multiply vs. divide.

Also, any negative y or z gets treated as 0, which is probably a bug unless your inputs are known non-negative.

if results can vary so widely, x86 microarchitectures will be my main use case

Yes, penalties (or lack thereof) vary greatly.

Historically (P6-family) Intel used to always take a very slow microcode assist for subnormal results and subnormal inputs, including for compares. Modern Intel CPUs (Sandybridge-family) handle some but not all FP operations on subnormal operands without needing a microcode assist. (perf event fp_assists.any)

The microcode assist is like an exception and flushes the out-of-order pipeline, and takes over 160 cycles on SnB-family, vs. ~10 to 20 for a branch miss.
And branch misses have "fast recovery" on modern CPUs. True branch-miss penalty depends on surrounding code; e.g. if the branch condition is really late to be ready it can result in discarding a lot of later independent work. But a microcode assist is still probably worse if you expect it to happen frequently.

Note that you can check for a subnormal using integer ops: just check the exponent field for all zero (and the mantissa for non-zero: the all-zero encoding for 0.0 is technically a special case of a subnormal). So you could manually flush to zero with integer SIMD operations like andps/pcmpeqd/andps

Agner Fog's microarch PDF has some info; he mentions this in general without a fully detailed breakdown for each uarch. I don't think https://uops.info/ tests for normal vs. subnormal unfortunately.

Knight's Landing (KNL) only has subnormal penalties for division, not add / mul. Like GPUs, they took an approach that favoured throughput over latency and have enough pipeline stages in their FPU to handle subnormals in the hardware equivalent of branchlessly. Even though this might mean higher latency for every FP operation.

AMD Bulldozer / Piledriver have a ~175 cycle penalty for results that are "subnormal or underflow", unless FTZ is set. Agner doesn't mention subnormal inputs. Steamroller/Excavator don't have any penalties.

AMD Ryzen (from Agner Fog's microarch pdf)

Floating point operations that give a subnormal result take a few clock cycles extra. The
same is the case when a multiplication or division underflows to zero. This is far less than
the high penalty on the Bulldozer and Piledriver. There is no penalty when flush-to-zero
mode and denormals-are-zero mode are both on.

By contrast, Intel Sandybridge-family (at least Skylake) doesn't have penalties for results that underflow all the way to 0.0.

Intel Silvermont (Atom) from Agner Fog's microarch pdf

Operations that have subnormal numbers as input or output or generate underflow take
approximately 160 clock cycles unless the flush-to-zero mode and denormals-are-zero
mode are both used.

This would include compares.

I don't know the details for any non-x86 microarchitectures, like ARM cortex-a76 or any RISC-V to pick a couple random examples that might also be relevant. Mispredict penalties vary wildly as well, across simple in-order pipelines vs. deep OoO exec CPUs like modern x86. True mispredict penalty also depends on surrounding code.

And now assume I want to avoid the performance penalty of dealing with denormal numbers and I just want to treat them as 0

Then you should set your FPU to do that for you for free, removing all possibility of penalties from subnormals.

Some / most(?) modern FPUs (including x86 SSE but not legacy x87) let you treat subnormals (aka denormals) as zero for free, so this problem only occurs if you want this behaviour for some functions but not all, within the same thread. And with too fine-grained switching to be worth changing the FP control register to FTZ and back.

Or could be relevant if you wanted to write fully portable code that was terrible nowhere, even if it meant ignoring HW support and thus being slower than it could be.

Some x86 CPUs do even rename MXCSR so changing the rounding mode or FTZ/DAZ might not have to drain the out-of-order back-end. It's still not cheap and you'd want to avoid doing it every few FP instructions.

ARM also supports a similar feature: subnormal IEEE 754 floating point numbers support on iOS ARM devices (iPhone 4) - but apparently the default setting for ARM VFP / NEON is to treat subnormals as zero, favouring performance over strict IEEE compliance.

See also flush-to-zero behavior in floating-point arithmetic about cross-platform availability of this.

On x86 the specific mechanism is that you set the DAZ and FTZ bits in the MXCSR register (SSE FP math control register; also has bits for FP rounding mode, FP exception masks, and sticky FP masked-exception status bits). https://software.intel.com/en-us/articles/x87-and-sse-floating-point-assists-in-ia-32-flush-to-zero-ftz-and-denormals-are-zero-daz shows the layout and also discusses some performance effects on older Intel CPUs. Lots of good background / introduction.

Compiling with -ffast-math will link in some extra startup code that sets FTZ/DAZ before calling main. IIRC, threads inherit the MXCSR settings from the main thread on most OSes.

DAZ = Denormals Are Zero, treats input subnormals as zero. This affects compares (whether or not they would have experienced a slowdown) making it impossible to even tell the difference between 0 and a subnormal other than using integer stuff on the bit-pattern.
FTZ = Flush To Zero, subnormal outputs from calculations are just underflowed to zeroed. i.e. disable gradual underflow. (Note that multiplying two small normal numbers can underflow. I think add/sub of normal numbers whose mantissas cancel out except for the low few bits could produce a subnormal as well.)

Usually you simply set both or neither. If you're processing input data from another thread or process, or compile-time constants, you could still have subnormal inputs even if all results you produce are normalized or 0.

Specific random questions:

float x = 0f; // Will x be just 0 or maybe some number like 1e-40;

This is a syntax error. Presumably you mean 0.f or 0.0f

0.0f is exactly representable (with the bit-pattern 0x00000000) as an IEEE binary32 float, so that's definitely what you will get on any platform that uses IEEE FP. You won't randomly get subnormals that you didn't write.

float z = x / 1; // Will this "no-op" (x == 0) cause z be something like 1e-40 and thus denormal?

No, IEEE754 doesn't allow 0.0 / 1.0 to give anything other than 0.0.

Again, subnormals don't appear out of thin air. Rounding "error" only happens when the exact result can't be represented as a float or double. The max allowed error for the IEEE "basic" operations (* / + - and sqrt) is 0.5 ulp, i.e. the exact result must be correctly rounded to the nearest representable FP value, right down to the last digit of the mantissa.

 bool yzero = y < 1e-37; // Have comparisons any performance penalty when y is denormal or they don't?

Maybe, maybe not. No penalty on recent AMD or Intel, but is slow on Core 2 for example.

Note that 1e-37 has type double and will cause promotion of y to double. You might hope that this would actually avoid subnormal penalties vs. using 1e-37f. Subnormal float->int has no penalty on Core 2, but unfortunately cvtss2sd does still have the large penalty on Core 2. (GCC/clang don't optimize away the conversion even with -ffast-math, although I think they could because 1e-37 is exactly representable as a flat, and every subnormal float can be exactly represented as a normalized double. So the promotion to double is always exact and can't change the result).

On Intel Skylake, comparing two subnormals with vcmplt_oqpd doesn't result in any slowdown, and not with ucomisd into integer FLAGS either. But on Core 2, both are slow.

Comparison, if done like subtraction, does have to shift the inputs to line up their binary place-values, and the implied leading digit of the mantissa is a 0 instead of 1 so subnormals are a special case. So hardware might choose to not handle that on the fast path and instead take a microcode assist. Older x86 hardware might handle this slower.

It could be done differently if you built a special compare ALU separate from the normal add/sub unit. Float bit-patterns can be compared as sign/magnitude integers (with a special case for NaN) because the IEEE exponent bias is chosen to make that work. (i.e. nextafter is just integer ++ or -- on the bit pattern). But this apparently isn't what hardware does.

FP conversion to integer is fast even on Core 2, though. cvt[t]ps2dq or the pd equivalent convert packed float/double to int32 with truncation or the current rounding mode. So for example this recent proposed LLVM optimization is safe on Skylake and Core 2, according to my testing.

Also on Skylake, squaring a subnormal (producing a 0) has no penalty. But it does have a huge penalty on Conroe (P6-family).

But multiplying normal numbers to produce a subnormal result has a penalty even on Skylake (~150x slower).

Why does changing 0.1f to 0 slow down performance by 10x?

Welcome to the world of denormalized floating-point! They can wreak havoc on performance!!!

Denormal (or subnormal) numbers are kind of a hack to get some extra values very close to zero out of the floating point representation. Operations on denormalized floating-point can be tens to hundreds of times slower than on normalized floating-point. This is because many processors can't handle them directly and must trap and resolve them using microcode.

If you print out the numbers after 10,000 iterations, you will see that they have converged to different values depending on whether 0 or 0.1 is used.

Here's the test code compiled on x64:

int main() {

    double start = omp_get_wtime();

    const float x[16]={1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.0,2.1,2.2,2.3,2.4,2.5,2.6};
    const float z[16]={1.123,1.234,1.345,156.467,1.578,1.689,1.790,1.812,1.923,2.034,2.145,2.256,2.367,2.478,2.589,2.690};
    float y[16];
    for(int i=0;i<16;i++)
    {
        y[i]=x[i];
    }
    for(int j=0;j<9000000;j++)
    {
        for(int i=0;i<16;i++)
        {
            y[i]*=x[i];
            y[i]/=z[i];
#ifdef FLOATING
            y[i]=y[i]+0.1f;
            y[i]=y[i]-0.1f;
#else
            y[i]=y[i]+0;
            y[i]=y[i]-0;
#endif

            if (j > 10000)
                cout << y[i] << "  ";
        }
        if (j > 10000)
            cout << endl;
    }

    double end = omp_get_wtime();
    cout << end - start << endl;

    system("pause");
    return 0;
}

Output:

#define FLOATING
1.78814e-007  1.3411e-007  1.04308e-007  0  7.45058e-008  6.70552e-008  6.70552e-008  5.58794e-007  3.05474e-007  2.16067e-007  1.71363e-007  1.49012e-007  1.2666e-007  1.11759e-007  1.04308e-007  1.04308e-007
1.78814e-007  1.3411e-007  1.04308e-007  0  7.45058e-008  6.70552e-008  6.70552e-008  5.58794e-007  3.05474e-007  2.16067e-007  1.71363e-007  1.49012e-007  1.2666e-007  1.11759e-007  1.04308e-007  1.04308e-007

//#define FLOATING
6.30584e-044  3.92364e-044  3.08286e-044  0  1.82169e-044  1.54143e-044  2.10195e-044  2.46842e-029  7.56701e-044  4.06377e-044  3.92364e-044  3.22299e-044  3.08286e-044  2.66247e-044  2.66247e-044  2.24208e-044
6.30584e-044  3.92364e-044  3.08286e-044  0  1.82169e-044  1.54143e-044  2.10195e-044  2.45208e-029  7.56701e-044  4.06377e-044  3.92364e-044  3.22299e-044  3.08286e-044  2.66247e-044  2.66247e-044  2.24208e-044

Note how in the second run the numbers are very close to zero.

Denormalized numbers are generally rare and thus most processors don't try to handle them efficiently.

To demonstrate that this has everything to do with denormalized numbers, if we flush denormals to zero by adding this to the start of the code:

_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);

Then the version with 0 is no longer 10x slower and actually becomes faster. (This requires that the code be compiled with SSE enabled.)

This means that rather than using these weird lower precision almost-zero values, we just round to zero instead.

Timings: Core i7 920 @ 3.5 GHz:

//  Don't flush denormals to zero.
0.1f: 0.564067
0   : 26.7669

//  Flush denormals to zero.
0.1f: 0.587117
0   : 0.341406

In the end, this really has nothing to do with whether it's an integer or floating-point. The 0 or 0.1f is converted/stored into a register outside of both loops. So that has no effect on performance.

Flushing denormalised numbers to zero

You're looking for a platform-defined way to set FTZ and/or DAZ in the MXCSR register (on x86 with SSE or x86-64); see https://stackoverflow.com/a/2487733/567292

Usually this is called something like _controlfp; Microsoft documentation is at http://msdn.microsoft.com/en-us/library/e9b52ceh.aspx

You can also use the _MM_SET_FLUSH_ZERO_MODE macro: http://msdn.microsoft.com/en-us/library/a8b5ts9s(v=vs.71).aspx - this is probably the most cross-platform portable method.

Create denormalized (subnormal) floating point values in C++

You could use -std::numeric_limits<T>::denorm_min() and std::numeric_limits<T>::denorm_min(). It is just incidental that the produced denormalized values have a special characteristic. If you don't want that, multiply by some reasonably small integer value.

Avoiding Denormal Values in C++