Optimizations for Pow() with Const Non-Integer Exponent

Optimizations for pow() with const non-integer exponent?

In the IEEE 754 hacking vein, here is another solution which is faster and less "magical." It achieves an error margin of .08% in about a dozen clock cycles (for the case of p=2.4, on an Intel Merom CPU).

Floating point numbers were originally invented as an approximation to logarithms, so you can use the integer value as an approximation of log2. This is somewhat-portably achievable by applying the convert-from-integer instruction to a floating-point value, to obtain another floating-point value.

To complete the pow computation, you can multiply by a constant factor and convert the logarithm back with the convert-to-integer instruction. On SSE, the relevant instructions are cvtdq2ps and cvtps2dq.

It's not quite so simple, though. The exponent field in IEEE 754 is signed, with a bias value of 127 representing an exponent of zero. This bias must be removed before you multiply the logarithm, and re-added before you exponentiate. Furthermore, bias adjustment by subtraction won't work on zero. Fortunately, both adjustments can be achieved by multiplying by a constant factor beforehand.

x^p
= exp2( p * log2( x ) )
= exp2( p * ( log2( x ) + 127 - 127 ) - 127 + 127 )
= cvtps2dq( p * ( log2( x ) + 127 - 127 - 127 / p ) )
= cvtps2dq( p * ( log2( x ) + 127 - log2( exp2( 127 - 127 / p ) ) )
= cvtps2dq( p * ( log2( x * exp2( 127 / p - 127 ) ) + 127 ) )
= cvtps2dq( p * ( cvtdq2ps( x * exp2( 127 / p - 127 ) ) ) )

exp2( 127 / p - 127 ) is the constant factor. This function is rather specialized: it won't work with small fractional exponents, because the constant factor grows exponentially with the inverse of the exponent and will overflow. It won't work with negative exponents. Large exponents lead to high error, because the mantissa bits are mingled with the exponent bits by the multiplication.

But, it's just 4 fast instructions long. Pre-multiply, convert from "integer" (to logarithm), power-multiply, convert to "integer" (from logarithm). Conversions are very fast on this implementation of SSE. We can also squeeze an extra constant coefficient into the first multiplication.

template< unsigned expnum, unsigned expden, unsigned coeffnum, unsigned coeffden >
__m128 fastpow( __m128 arg ) {
        __m128 ret = arg;
//      std::printf( "arg = %,vg\n", ret );
        // Apply a constant pre-correction factor.
        ret = _mm_mul_ps( ret, _mm_set1_ps( exp2( 127. * expden / expnum - 127. )
                * pow( 1. * coeffnum / coeffden, 1. * expden / expnum ) ) );
//      std::printf( "scaled = %,vg\n", ret );
        // Reinterpret arg as integer to obtain logarithm.
        asm ( "cvtdq2ps %1, %0" : "=x" (ret) : "x" (ret) );
//      std::printf( "log = %,vg\n", ret );
        // Multiply logarithm by power.
        ret = _mm_mul_ps( ret, _mm_set1_ps( 1. * expnum / expden ) );
//      std::printf( "powered = %,vg\n", ret );
        // Convert back to "integer" to exponentiate.
        asm ( "cvtps2dq %1, %0" : "=x" (ret) : "x" (ret) );
//      std::printf( "result = %,vg\n", ret );
        return ret;
}

A few trials with exponent = 2.4 show this consistently overestimates by about 5%. (The routine is always guaranteed to overestimate.) You could simply multiply by 0.95, but a few more instructions will get us about 4 decimal digits of accuracy, which should be enough for graphics.

The key is to match the overestimate with an underestimate, and take the average.

Compute x^0.8: four instructions, error ~ +3%.
Compute x^-0.4: one rsqrtps. (This is quite accurate enough, but does sacrifice the ability to work with zero.)
Compute x^0.4: one mulps.
Compute x^-0.2: one rsqrtps.
Compute x^2: one mulps.
Compute x^3: one mulps.
x^2.4 = x^2 * x^0.4: one mulps. This is the overestimate.
x^2.4 = x^3 * x^-0.4 * x^-0.2: two mulps. This is the underestimate.
Average the above: one addps, one mulps.

Instruction tally: fourteen, including two conversions with latency = 5 and two reciprocal square root estimates with throughput = 4.

To properly take the average, we want to weight the estimates by their expected errors. The underestimate raises the error to a power of 0.6 vs 0.4, so we expect it to be 1.5x as erroneous. Weighting doesn't add any instructions; it can be done in the pre-factor. Calling the coefficient a: a^0.5 = 1.5 a^-0.75, and a = 1.38316186.

The final error is about .015%, or 2 orders of magnitude better than the initial fastpow result. The runtime is about a dozen cycles for a busy loop with volatile source and destination variables… although it's overlapping the iterations, real-world usage will also see instruction-level parallelism. Considering SIMD, that's a throughput of one scalar result per 3 cycles!

int main() {
        __m128 const x0 = _mm_set_ps( 0.01, 1, 5, 1234.567 );
        std::printf( "Input: %,vg\n", x0 );

        // Approx 5% accuracy from one call. Always an overestimate.
        __m128 x1 = fastpow< 24, 10, 1, 1 >( x0 );
        std::printf( "Direct x^2.4: %,vg\n", x1 );

        // Lower exponents provide lower initial error, but too low causes overflow.
        __m128 xf = fastpow< 8, 10, int( 1.38316186 * 1e9 ), int( 1e9 ) >( x0 );
        std::printf( "1.38 x^0.8: %,vg\n", xf );

        // Imprecise 4-cycle sqrt is still far better than fastpow, good enough.
        __m128 xfm4 = _mm_rsqrt_ps( xf );
        __m128 xf4 = _mm_mul_ps( xf, xfm4 );

        // Precisely calculate x^2 and x^3
        __m128 x2 = _mm_mul_ps( x0, x0 );
        __m128 x3 = _mm_mul_ps( x2, x0 );

        // Overestimate of x^2 * x^0.4
        x2 = _mm_mul_ps( x2, xf4 );

        // Get x^-0.2 from x^0.4. Combine with x^-0.4 into x^-0.6 and x^2.4.
        __m128 xfm2 = _mm_rsqrt_ps( xf4 );
        x3 = _mm_mul_ps( x3, xfm4 );
        x3 = _mm_mul_ps( x3, xfm2 );

        std::printf( "x^2 * x^0.4: %,vg\n", x2 );
        std::printf( "x^3 / x^0.6: %,vg\n", x3 );
        x2 = _mm_mul_ps( _mm_add_ps( x2, x3 ), _mm_set1_ps( 1/ 1.960131704207789 ) );
        // Final accuracy about 0.015%, 200x better than x^0.8 calculation.
        std::printf( "average = %,vg\n", x2 );
}

Well… sorry I wasn't able to post this sooner. And extending it to x^1/2.4 is left as an exercise ;v) .

Update with stats

I implemented a little test harness and two x^{(⁵⁄₁₂)} cases corresponding to the above.

#include <cstdio>
#include <xmmintrin.h>
#include <cmath>
#include <cfloat>
#include <algorithm>
using namespace std;

template< unsigned expnum, unsigned expden, unsigned coeffnum, unsigned coeffden >
__m128 fastpow( __m128 arg ) {
    __m128 ret = arg;
//  std::printf( "arg = %,vg\n", ret );
    // Apply a constant pre-correction factor.
    ret = _mm_mul_ps( ret, _mm_set1_ps( exp2( 127. * expden / expnum - 127. )
        * pow( 1. * coeffnum / coeffden, 1. * expden / expnum ) ) );
//  std::printf( "scaled = %,vg\n", ret );
    // Reinterpret arg as integer to obtain logarithm.
    asm ( "cvtdq2ps %1, %0" : "=x" (ret) : "x" (ret) );
//  std::printf( "log = %,vg\n", ret );
    // Multiply logarithm by power.
    ret = _mm_mul_ps( ret, _mm_set1_ps( 1. * expnum / expden ) );
//  std::printf( "powered = %,vg\n", ret );
    // Convert back to "integer" to exponentiate.
    asm ( "cvtps2dq %1, %0" : "=x" (ret) : "x" (ret) );
//  std::printf( "result = %,vg\n", ret );
    return ret;
}

__m128 pow125_4( __m128 arg ) {
    // Lower exponents provide lower initial error, but too low causes overflow.
    __m128 xf = fastpow< 4, 5, int( 1.38316186 * 1e9 ), int( 1e9 ) >( arg );

    // Imprecise 4-cycle sqrt is still far better than fastpow, good enough.
    __m128 xfm4 = _mm_rsqrt_ps( xf );
    __m128 xf4 = _mm_mul_ps( xf, xfm4 );

    // Precisely calculate x^2 and x^3
    __m128 x2 = _mm_mul_ps( arg, arg );
    __m128 x3 = _mm_mul_ps( x2, arg );

    // Overestimate of x^2 * x^0.4
    x2 = _mm_mul_ps( x2, xf4 );

    // Get x^-0.2 from x^0.4, and square it for x^-0.4. Combine into x^-0.6.
    __m128 xfm2 = _mm_rsqrt_ps( xf4 );
    x3 = _mm_mul_ps( x3, xfm4 );
    x3 = _mm_mul_ps( x3, xfm2 );

    return _mm_mul_ps( _mm_add_ps( x2, x3 ), _mm_set1_ps( 1/ 1.960131704207789 * 0.9999 ) );
}

__m128 pow512_2( __m128 arg ) {
    // 5/12 is too small, so compute the sqrt of 10/12 instead.
    __m128 x = fastpow< 5, 6, int( 0.992245 * 1e9 ), int( 1e9 ) >( arg );
    return _mm_mul_ps( _mm_rsqrt_ps( x ), x );
}

__m128 pow512_4( __m128 arg ) {
    // 5/12 is too small, so compute the 4th root of 20/12 instead.
    // 20/12 = 5/3 = 1 + 2/3 = 2 - 1/3. 2/3 is a suitable argument for fastpow.
    // weighting coefficient: a^-1/2 = 2 a; a = 2^-2/3
    __m128 xf = fastpow< 2, 3, int( 0.629960524947437 * 1e9 ), int( 1e9 ) >( arg );
    __m128 xover = _mm_mul_ps( arg, xf );

    __m128 xfm1 = _mm_rsqrt_ps( xf );
    __m128 x2 = _mm_mul_ps( arg, arg );
    __m128 xunder = _mm_mul_ps( x2, xfm1 );

    // sqrt2 * over + 2 * sqrt2 * under
    __m128 xavg = _mm_mul_ps( _mm_set1_ps( 1/( 3 * 0.629960524947437 ) * 0.999852 ),
                                _mm_add_ps( xover, xunder ) );

    xavg = _mm_mul_ps( xavg, _mm_rsqrt_ps( xavg ) );
    xavg = _mm_mul_ps( xavg, _mm_rsqrt_ps( xavg ) );
    return xavg;
}

__m128 mm_succ_ps( __m128 arg ) {
    return (__m128) _mm_add_epi32( (__m128i) arg, _mm_set1_epi32( 4 ) );
}

void test_pow( double p, __m128 (*f)( __m128 ) ) {
    __m128 arg;

    for ( arg = _mm_set1_ps( FLT_MIN / FLT_EPSILON );
            ! isfinite( _mm_cvtss_f32( f( arg ) ) );
            arg = mm_succ_ps( arg ) ) ;

    for ( ; _mm_cvtss_f32( f( arg ) ) == 0;
            arg = mm_succ_ps( arg ) ) ;

    std::printf( "Domain from %g\n", _mm_cvtss_f32( arg ) );

    int n;
    int const bucket_size = 1 << 25;
    do {
        float max_error = 0;
        double total_error = 0, cum_error = 0;
        for ( n = 0; n != bucket_size; ++ n ) {
            float result = _mm_cvtss_f32( f( arg ) );

            if ( ! isfinite( result ) ) break;

            float actual = ::powf( _mm_cvtss_f32( arg ), p );

            float error = ( result - actual ) / actual;
            cum_error += error;
            error = std::abs( error );
            max_error = std::max( max_error, error );
            total_error += error;

            arg = mm_succ_ps( arg );
        }

        std::printf( "error max = %8g\t" "avg = %8g\t" "|avg| = %8g\t" "to %8g\n",
                    max_error, cum_error / n, total_error / n, _mm_cvtss_f32( arg ) );
    } while ( n == bucket_size );
}

int main() {
    std::printf( "4 insn x^12/5:\n" );
    test_pow( 12./5, & fastpow< 12, 5, 1059, 1000 > );
    std::printf( "14 insn x^12/5:\n" );
    test_pow( 12./5, & pow125_4 );
    std::printf( "6 insn x^5/12:\n" );
    test_pow( 5./12, & pow512_2 );
    std::printf( "14 insn x^5/12:\n" );
    test_pow( 5./12, & pow512_4 );
}

Output:

4 insn x^12/5:
Domain from 1.36909e-23
error max =      inf    avg =      inf  |avg| =      inf    to 8.97249e-19
error max =  2267.14    avg =  139.175  |avg| =  139.193    to 5.88021e-14
error max = 0.123606    avg = -0.000102963  |avg| = 0.0371122   to 3.85365e-09
error max = 0.123607    avg = -0.000108978  |avg| = 0.0368548   to 0.000252553
error max =  0.12361    avg = 7.28909e-05   |avg| = 0.037507    to  16.5513
error max = 0.123612    avg = -0.000258619  |avg| = 0.0365618   to 1.08471e+06
error max = 0.123611    avg = 8.70966e-05   |avg| = 0.0374369   to 7.10874e+10
error max =  0.12361    avg = -0.000103047  |avg| = 0.0371122   to 4.65878e+15
error max = 0.123609    avg =      nan  |avg| =      nan    to 1.16469e+16
14 insn x^12/5:
Domain from 1.42795e-19
error max =      inf    avg =      nan  |avg| =      nan    to 9.35823e-15
error max = 0.000936462 avg = 2.0202e-05    |avg| = 0.000133764 to 6.13301e-10
error max = 0.000792752 avg = 1.45717e-05   |avg| = 0.000129936 to 4.01933e-05
error max = 0.000791785 avg = 7.0132e-06    |avg| = 0.000129923 to  2.63411
error max = 0.000787589 avg = 1.20745e-05   |avg| = 0.000129347 to   172629
error max = 0.000786553 avg = 1.62351e-05   |avg| = 0.000132397 to 1.13134e+10
error max = 0.000785586 avg = 8.25205e-06   |avg| = 0.00013037  to 6.98147e+12
6 insn x^5/12:
Domain from 9.86076e-32
error max = 0.0284339   avg = 0.000441158   |avg| = 0.00967327  to 6.46235e-27
error max = 0.0284342   avg = -5.79938e-06  |avg| = 0.00897913  to 4.23516e-22
error max = 0.0284341   avg = -0.000140706  |avg| = 0.00897084  to 2.77556e-17
error max = 0.028434    avg = 0.000440504   |avg| = 0.00967325  to 1.81899e-12
error max = 0.0284339   avg = -6.11153e-06  |avg| = 0.00897915  to 1.19209e-07
error max = 0.0284298   avg = -0.000140597  |avg| = 0.00897084  to 0.0078125
error max = 0.0284371   avg = 0.000439748   |avg| = 0.00967319  to      512
error max = 0.028437    avg = -7.74294e-06  |avg| = 0.00897924  to 3.35544e+07
error max = 0.0284369   avg = -0.000142036  |avg| = 0.00897089  to 2.19902e+12
error max = 0.0284368   avg = 0.000439183   |avg| = 0.0096732   to 1.44115e+17
error max = 0.0284367   avg = -7.41244e-06  |avg| = 0.00897923  to 9.44473e+21
error max = 0.0284366   avg = -0.000141706  |avg| = 0.00897088  to 6.1897e+26
error max = 0.485129    avg = -0.0401671    |avg| = 0.048422    to 4.05648e+31
error max = 0.994932    avg = -0.891494 |avg| = 0.891494    to 2.65846e+36
error max = 0.999329    avg =      nan  |avg| =      nan    to       -0
14 insn x^5/12:
Domain from 2.64698e-23
error max =  0.13556    avg = 0.00125936    |avg| = 0.00354677  to 1.73472e-18
error max = 0.000564988 avg = 2.51458e-06   |avg| = 0.000113709 to 1.13687e-13
error max = 0.000565065 avg = -1.49258e-06  |avg| = 0.000112553 to 7.45058e-09
error max = 0.000565143 avg = 1.5293e-06    |avg| = 0.000112864 to 0.000488281
error max = 0.000565298 avg = 2.76457e-06   |avg| = 0.000113713 to       32
error max = 0.000565453 avg = -1.61276e-06  |avg| = 0.000112561 to 2.09715e+06
error max = 0.000565531 avg = 1.42628e-06   |avg| = 0.000112866 to 1.37439e+11
error max = 0.000565686 avg = 2.71505e-06   |avg| = 0.000113715 to 9.0072e+15
error max = 0.000565763 avg = -1.56586e-06  |avg| = 0.000112415 to 1.84467e+19

I suspect accuracy of the more accurate 5/12 is being limited by the rsqrt operation.

Optimizing pow algorithm

You might give a look at: Optimized Approximative pow() in C / C++

It also includes a benchmark.

GLSL: pow vs multiplication for integer exponent

While this can definitely be hardware/vendor/compiler dependent, advanced mathematical functions like pow() tend to be considerably more expensive than basic operations.

The best approach is of course to try both, and benchmark. But if there is a simple replacement for an advanced mathematical functions, I don't think you can go very wrong by using it.

If you write pow(x, 3.0), the best you can probably hope for is that the compiler will recognize the special case, and expand it. But why take the risk, if the replacement is just as short and easy to read? C/C++ compilers don't always replace pow(x, 2.0) by a simple multiplication, so I wouldn't necessarily count on all GLSL compilers to do that.

Calculating in an optimized way pow(x,y) for 0=x=1 and 1y2 in C++

Replace pow(x,y) by

exp(y*log(x))

This is also sometimes the C library implementation but gives slightly distorted results for many integer inputs. Thus the pow(x,y) implementation usually has an overhead to catch trivial powers with exponents 0 and 1, integer powers, and does some other transformations to get the most precise result. Cutting out this overhead may be already a sufficient speed-up.

The most efficient way to implement an integer based power function pow(int, int)

Exponentiation by squaring.

int ipow(int base, int exp)
{
    int result = 1;
    for (;;)
    {
        if (exp & 1)
            result *= base;
        exp >>= 1;
        if (!exp)
            break;
        base *= base;
    }

    return result;
}

This is the standard method for doing modular exponentiation for huge numbers in asymmetric cryptography.

Eigen: coeficient-wise pow with smalll integer exponent slow

With C++17 if constexpr this could be possible, but otherwise it's not. So currently, a.pow(x) is equivalent to calling std::pow(a[i],x) for each i.

Is there anyway to compute a negative base to the power of a non integer exponent in Java?

How would you define that power? The standard method is to find the polar decomposition resp. complex logarithm of the base and plug the power into that,

pow(-3, 0.6) = exp( 0.6 * (log(3) + i*pi) )
= pow(3,0.6) * (cos(0.6*pi) + i*sin(0.6*pi))

Now you could also chose other branches of the logarithm to represent Ln(-3), as it is also true that -1=exp(-i*pi)=exp(i*3*pi)=exp(-5*i*pi)=exp(i*7*pi)=…. Only one among those variants will give a real result to the power. However, how is the computer to know that you want exactly that variant? And what do you do for pow(-3,0.61)?

Efficient implementation of fixed-point power (pow) function with argument constraints

The problem as stated by OP is underspecified. What accuracy is needed, what performance is required? As a is known to be non-negative, a suitable starting point is to compute a^b as exp2 (b * log2 (a)), with fixed-point implementations for those functions. Based on my experience with fixed-point arithmetic in embedded systems, using an s15.16 fixed-point format is often a good compromise between representable range and accuracy when generic floating-point computation is replaced with fixed-point computation. So I will adopt that here.

The principal choices for implementing exp2() and log2() in fixed-point are bitwise computation, quadratic interpolation in a table, and polynomial approximation. The latter two benefit from a hardware multiplier that can produce a double-width product, and table-based approaches work best with a cache of several kilobytes. The OP's specification of a microcontroller suggests that resources for both code and data may be limited, and that this is fairly low-end hardware. So the bit-wise approach may be a good starting point, as it requires only a tiny table shared between the two functions. the code size is small as well, in particular when the compiler is directed to optimize for code size (-Os with most compilers).

Bit-wise computations for these functions are also known a pseudo-division and pseudo-multiplication and are closely related to the way by which Henry Briggs (1561 – 1630) computed tables logarithms.

For the computation of the logarithm, we chose, by trial and error, from among a sequence of particular factors those, that when multiplied with the function argument, normalize it to unity. For every factor chosen, we add up the corresponding logarithm value stored in a table. The sum then corresponds to the logarithm of the function argument. For the integer portion, we would want to chose 2ⁱ, i = 1, 2, 4, 8, ... and for the factional part we choose 1+2^-i, i = 1, 2, 3, 4, 5 ...

The algorithm for the exponential is closely related and quite literally the inverse of the logarithm algorithm. From among the sequence of tabulated logarithms for our chosen factors, we pick those that when subtracted from the function argument ultimately reduce it to zero. Starting with unity, we multiply with all the factors whose logarithms we have subtracted in the process. The resulting product corresponds to the result of the exponentiation.

Logarithms of values greater than unity can be computed by applying the algorithm to the reciprocal of the function argument (with negation of the result as appropriate), likewise exponentiation with negative function arguments requires us negate the function argument and compute the reciprocal at the end. So in this aspect the two algorithms are, unsurprisingly, symmetric as well.

The following ISO-C99 code is a straightforward implementation of these two algorithms. Note that this code uses a fixed-point multiplication with rounding. The incremental cost is minimal and it does enhance the accuracy of the overall pow() computation.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>

/* s15.16 division without rounding */
int32_t fxdiv_s15p16 (int32_t x, int32_t y)
{
    return ((int64_t)x * 65536) / y;
}

/* s15.16 multiplication with rounding */
int32_t fxmul_s15p16 (int32_t x, int32_t y)
{
    int32_t r;
    int64_t t = (int64_t)x * (int64_t)y;
    r = (int32_t)(uint32_t)(((uint64_t)t + (1 << 15)) >> 16);
    return r;
}

/* log2(2**8), ..., log2(2**1), log2(1+2**(-1), ..., log2(1+2**(-16)) */
const uint32_t tab [20] = {0x80000, 0x40000, 0x20000, 0x10000,
                           0x095c1, 0x0526a, 0x02b80, 0x01663,
                           0x00b5d, 0x005b9, 0x002e0, 0x00170, 
                           0x000b8, 0x0005c, 0x0002e, 0x00017, 
                           0x0000b, 0x00006, 0x00003, 0x00001};
const int32_t one_s15p16 = 1 * (1 << 16);
const int32_t neg_fifteen_s15p16 = (-15) * (1 << 16);

int32_t fxlog2_s15p16 (int32_t a)
{
    uint32_t x, y;
    int32_t t, r;

    x = (a > one_s15p16) ? fxdiv_s15p16 (one_s15p16, a) : a;
    y = 0;
    /* process integer bits */
    if ((t = x << 8) < one_s15p16) { x = t; y += tab [0]; }
    if ((t = x << 4) < one_s15p16) { x = t; y += tab [1]; }
    if ((t = x << 2) < one_s15p16) { x = t; y += tab [2]; }
    if ((t = x << 1) < one_s15p16) { x = t; y += tab [3]; }
    /* process fraction bits */
    for (int shift = 1; shift <= 16; shift++) {
        if ((t = x + (x >> shift)) < one_s15p16) { x = t; y += tab[3 + shift]; }
    }
    r = (a > one_s15p16) ? y : (0 - y);
    return r;
}

int32_t fxexp2_s15p16 (int32_t a) 
{
    uint32_t x, y;
    int32_t t, r;

    if (a <= neg_fifteen_s15p16) return 0; // underflow

    x = (a < 0) ? (-a) : (a);
    y = one_s15p16;
    /* process integer bits */
    if ((t = x - tab [0]) >= 0) { x = t; y = y << 8; }
    if ((t = x - tab [1]) >= 0) { x = t; y = y << 4; }
    if ((t = x - tab [2]) >= 0) { x = t; y = y << 2; }
    if ((t = x - tab [3]) >= 0) { x = t; y = y << 1; }
    /* process fractional bits */
    for (int shift = 1; shift <= 16; shift++) {
        if ((t = x - tab [3 + shift]) >= 0) { x = t; y = y + (y >> shift); }
    }
    r = (a < 0) ? fxdiv_s15p16 (one_s15p16, y) : y;
    return r;
}

/* compute a**b for a >= 0 */
int32_t fxpow_s15p16 (int32_t a, int32_t b)
{
    return fxexp2_s15p16 (fxmul_s15p16 (b, fxlog2_s15p16 (a)));
}

double s15p16_to_double (int32_t a)
{
    return a / 65536.0;
}

int32_t double_to_s15p16 (double a)
{
    return ((int32_t)(a * 65536.0 + ((a < 0) ? (-0.5) : 0.5)));
}

int main (void) 
{
    int32_t a = double_to_s15p16 (0.125);
    do {
        int32_t b = double_to_s15p16 (-5);
        do {
            double fa = s15p16_to_double (a);
            double fb = s15p16_to_double (b);
            double reff = pow (fa, fb);
            int32_t res = fxpow_s15p16 (a, b);
            int32_t ref = double_to_s15p16 (reff);
            double fres = s15p16_to_double (res);
            double fref = s15p16_to_double (ref);
            printf ("a=%08x (%15.8e) b=%08x (% 15.8e) res=%08x (% 15.8e) ref=%08x (%15.8e)\n", 
                    a, fa, b, fb, res, fres, ref, fref);
            b += double_to_s15p16 (0.5);
        } while (b <= double_to_s15p16 (6));
        printf ("\n");
        a += double_to_s15p16 (0.125);
    } while (a <= double_to_s15p16 (1.0));
    return EXIT_SUCCESS;
}

For processors with a reasonable amount cache and a fast integer multiply, we can use quadratic interpolation in tables for very accurate and fast computation on log2() and exp2(). To do so, we make use of a pseudo-floating-point representation where the argument x = 2ⁱ(1+f), with f in [0,1). The fraction f is normalized using the full word size of 32 bits. For the logarithm, we tabulate log2(1+f) which falls into [0, 1). For the exponential we tabulate exp2(f)-1 which also falls into [0,1). Obviously we need to add 1 to the result of the table interpolation.

For a quadratic interpolation we need a total of three table entries through which we fit a parabola, whose coefficients a and b can be computed on the fly. The difference between the function argument and the closest node as dx, we can then add a(dx)²+b(dx) to the corresponding table entry for interpolation. The need for three consecutive table entries to fit the parabola also means we need to tabulate two additional values beyond our primary interpolation interval. The inaccuracy due to the use of fixed-point arithmetic for intermediate computation can partially be compensated by adding a rounding constant at the end, which needs to be determined experimentally.

    /* for i = 0 to 65: log2 (1 + i/64) * 2**31 */
    uint32_t logTab [66] =
    {
        0x00000000, 0x02dcf2d1, 0x05aeb4dd, 0x08759c50, 
        0x0b31fb7d, 0x0de42120, 0x108c588d, 0x132ae9e2, 
        0x15c01a3a, 0x184c2bd0, 0x1acf5e2e, 0x1d49ee4c, 
        0x1fbc16b9, 0x22260fb6, 0x24880f56, 0x26e2499d, 
        0x2934f098, 0x2b803474, 0x2dc4439b, 0x30014ac6, 
        0x32377512, 0x3466ec15, 0x368fd7ee, 0x38b25f5a, 
        0x3acea7c0, 0x3ce4d544, 0x3ef50ad2, 0x40ff6a2e, 
        0x43041403, 0x450327eb, 0x46fcc47a, 0x48f10751, 
        0x4ae00d1d, 0x4cc9f1ab, 0x4eaecfeb, 0x508ec1fa, 
        0x5269e12f, 0x5440461c, 0x5612089a, 0x57df3fd0, 
        0x59a80239, 0x5b6c65aa, 0x5d2c7f59, 0x5ee863e5, 
        0x60a02757, 0x6253dd2c, 0x64039858, 0x65af6b4b, 
        0x675767f5, 0x68fb9fce, 0x6a9c23d6, 0x6c39049b, 
        0x6dd2523d, 0x6f681c73, 0x70fa728c, 0x72896373, 
        0x7414fdb5, 0x759d4f81, 0x772266ad, 0x78a450b8, 
        0x7a231ace, 0x7b9ed1c7, 0x7d17822f, 0x7e8d3846, 
        0x80000000, 0x816fe50b
    };
    
    /* for i = 0 to 129: exp2 ((i/128) - 1) * 2**31 */
    uint32_t expTab [130] =
    {
        0x00000000, 0x00b1ed50, 0x0164d1f4, 0x0218af43,
        0x02cd8699, 0x0383594f, 0x043a28c4, 0x04f1f656,
        0x05aac368, 0x0664915c, 0x071f6197, 0x07db3580,
        0x08980e81, 0x0955ee03, 0x0a14d575, 0x0ad4c645,
        0x0b95c1e4, 0x0c57c9c4, 0x0d1adf5b, 0x0ddf0420,
        0x0ea4398b, 0x0f6a8118, 0x1031dc43, 0x10fa4c8c,
        0x11c3d374, 0x128e727e, 0x135a2b2f, 0x1426ff10,
        0x14f4efa9, 0x15c3fe87, 0x16942d37, 0x17657d4a,
        0x1837f052, 0x190b87e2, 0x19e04593, 0x1ab62afd,
        0x1b8d39ba, 0x1c657368, 0x1d3ed9a7, 0x1e196e19,
        0x1ef53261, 0x1fd22825, 0x20b05110, 0x218faecb,
        0x22704303, 0x23520f69, 0x243515ae, 0x25195787,
        0x25fed6aa, 0x26e594d0, 0x27cd93b5, 0x28b6d516,
        0x29a15ab5, 0x2a8d2653, 0x2b7a39b6, 0x2c6896a5,
        0x2d583eea, 0x2e493453, 0x2f3b78ad, 0x302f0dcc,
        0x3123f582, 0x321a31a6, 0x3311c413, 0x340aaea2,
        0x3504f334, 0x360093a8, 0x36fd91e3, 0x37fbefcb,
        0x38fbaf47, 0x39fcd245, 0x3aff5ab2, 0x3c034a7f,
        0x3d08a39f, 0x3e0f680a, 0x3f1799b6, 0x40213aa2,
        0x412c4cca, 0x4238d231, 0x4346ccda, 0x44563ecc,
        0x45672a11, 0x467990b6, 0x478d74c9, 0x48a2d85d,
        0x49b9bd86, 0x4ad2265e, 0x4bec14ff, 0x4d078b86,
        0x4e248c15, 0x4f4318cf, 0x506333db, 0x5184df62,
        0x52a81d92, 0x53ccf09a, 0x54f35aac, 0x561b5dff,
        0x5744fccb, 0x5870394c, 0x599d15c2, 0x5acb946f,
        0x5bfbb798, 0x5d2d8185, 0x5e60f482, 0x5f9612df,
        0x60ccdeec, 0x62055b00, 0x633f8973, 0x647b6ca0,
        0x65b906e7, 0x66f85aab, 0x68396a50, 0x697c3840,
        0x6ac0c6e8, 0x6c0718b6, 0x6d4f301f, 0x6e990f98,
        0x6fe4b99c, 0x713230a8, 0x7281773c, 0x73d28fde,
        0x75257d15, 0x767a416c, 0x77d0df73, 0x792959bb,
        0x7a83b2db, 0x7bdfed6d, 0x7d3e0c0d, 0x7e9e115c,
        0x80000000, 0x8163daa0,
    };
    
    uint8_t clz_tab[32] = {
        31, 22, 30, 21, 18, 10, 29,  2, 20, 17, 15, 13, 9,  6, 28, 1,
        23, 19, 11,  3, 16, 14,  7, 24, 12,  4,  8, 25, 5, 26, 27, 0};
    
    /* count leading zeros; this is a machine instruction on many architectures */
    int32_t clz (uint32_t a)
    {
        a |= a >> 16;
        a |= a >> 8;
        a |= a >> 4;
        a |= a >> 2;
        a |= a >> 1;
        return clz_tab [0x07c4acdd * a >> 27];
    }
    
    int32_t fxlog2_s15p16 (int32_t arg)
    {
        int32_t lz, f1, f2, dx, a, b, approx;
        uint32_t t, idx, x = arg;
        lz = clz (x);
        /* shift off integer bits and normalize fraction 0 <= f <= 1 */
        t = x << (lz + 1);
        /* index table of values log2 (1 + f) using 6 msbs of fraction */
        idx = (unsigned)t >> (32 - 6);
        /* difference between argument and smallest sampling point */
        dx = t - (idx << (32 - 6));
        /* fit parabola through closest three sampling points; find coeffs a, b */
        f1 = (logTab[idx+1] - logTab[idx]);
        f2 = (logTab[idx+2] - logTab[idx]);
        a = f2 - (f1 << 1);
        b = (f1 << 1) - a;
        /* find function value for argument by computing ((a*dx+b)*dx) */
        approx = (int32_t)((((int64_t)a)*dx) >> (32 - 6)) + b;
        approx = (int32_t)((((int64_t)approx)*dx) >> (32 - 6 + 1));
        /* compute fraction result; add experimentally determined rounding constant */
        approx = logTab[idx] + approx + 0x410d;
        /* combine integer and fractional parts of result */
        approx = ((15 - lz) << 16) + (((uint32_t)approx) >> 15);
        return approx;
    }
    
    int32_t fxexp2_s15p16 (int32_t x)

Optimizations for Pow() with Const Non-Integer Exponent