Fast Fixed Point Pow, Log, Exp and Sqrt

Fast fixed point pow, log, exp and sqrt

A very simple solution is to use a decent table-driven approximation. You don't actually need a lot of data if you reduce your inputs correctly. exp(a)==exp(a/2)*exp(a/2), which means you really only need to calculate exp(x) for 1 < x < 2. Over that range, a runga-kutta approximation would give reasonable results with ~16 entries IIRC.

Similarly, sqrt(a) == 2 * sqrt(a/4) == sqrt(4*a) / 2 which means you need only table entries for 1 < a < 4. Log(a) is a bit harder: log(a) == 1 + log(a/e). This is a rather slow iteration, but log(1024) is only 6.9 so you won't have many iterations.

You'd use a similar "integer-first" algorithm for pow: pow(x,y)==pow(x, floor(y)) * pow(x, frac(y)). This works because pow(double, int) is trivial (divide and conquer).

[edit] For the integral component of log(a), it may be useful to store a table 1, e, e^2, e^3, e^4, e^5, e^6, e^7 so you can reduce log(a) == n + log(a/e^n) by a simple hardcoded binary search of a in that table. The improvement from 7 to 3 steps isn't so big, but it means you only have to divide once by e^n instead of n times by e.

[edit 2]
And for that last log(a/e^n) term, you can use log(a/e^n) = log((a/e^n)^8)/8 - each iteration produces 3 more bits ~~by table lookup~~. That keeps your code and table size small. This is typically code for embedded systems, and they don't have large caches.

[edit 3]
That's still not to smart on my side. log(a) = log(2) + log(a/2). You can just store the fixed-point value log2=0.6931471805599, count the number of leading zeroes, shift a into the range used for your lookup table, and multiply that shift (integer) by the fixed-point constant log2. Can be as low as 3 instructions.

Using e for the reduction step just gives you a "nice" log(e)=1.0 constant but that's false optimization. 0.6931471805599 is just as good a constant as 1.0; both are 32 bits constants in 10.22 fixed point. Using 2 as the constant for range reduction allows you to use a bit shift for a division.

[edit 5]
And since you're storing it in Q10.22, you can better store log(65536)=11.09035488. (16 x log(2)). The "x16" means that we've got 4 more bits of precision available.

You still get the trick from edit 2, log(a/2^n) = log((a/2^n)^8)/8. Basically, this gets you a result (a + b/8 + c/64 + d/512) * 0.6931471805599 - with b,c,d in the range [0,7]. a.bcd really is an octal number. Not a surprise since we used 8 as the power. (The trick works equally well with power 2, 4 or 16.)

[edit 4]
Still had an open end. pow(x, frac(y) is just pow(sqrt(x), 2 * frac(y)) and we have a decent 1/sqrt(x). That gives us the far more efficient approach. Say frac(y)=0.101 binary, i.e. 1/2 plus 1/8. Then that means x^0.101 is (x^1/2 * x^1/8). But x^1/2 is just sqrt(x) and x^1/8 is (sqrt(sqrt(sqrt(x))). Saving one more operation, Newton-Raphson NR(x) gives us 1/sqrt(x) so we calculate 1.0/(NR(x)*NR((NR(NR(x))). We only invert the end result, don't use the sqrt function directly.

Fixed point power function

"If i try to represent the value 2.8928 with a Q format of, lets say Q12. The value changes to 11849. Now 10^11849 results in overflow.".

Mixed-type math is pretty hard, and it looks like you should avoid it. What you want is pow(Q12(10.0), Q12(2.8928)) or possibly an optimized pow10(Q12(2.8928)). For the first, see my previous answer. The latter can be optimized by a hardcoded table of powers. pow10(2.8928) is of course pow10(2) * pow10(.5) * pow10(.25) * pow10(.125) * ... - each 1 in the binary representation of 2.8928 corresponds to a single table entry. You may want to calculate the intermediate results in Q19.44 and drop the lowest 32 bits when you return..

Edit: Precision

Storing all the values of pow10(2^-n) up to n=12 has the slight problem that the result is close to 1, namely 1.000562312. If you'd store that as a Q12, you lose precision in rounding. Instead, it may be wise to store the value of pow10(2^-12) as a Q24, the value of pow10(2^-121) as a Q23 etc. Now evaluate Q12 pow10(Q12 exp) starting at the LSB of exp, not the MSB. You need to repeatedly shift the intermediate results as you move up to pow10(0.5) but half of the time you can merge that with the >>12 that's inherent to Q12 multiplication.

Optimizing Fixed-Point Sqrt

Reinventing the wheel, really, and not in a good way. The correct solution is to calculate 1/sqrt(x) using NR, and then multiply once to get x/sqrt(x) - just check for x==0 up front.

The reason why this is so much better is that the NR step for y=1/sqrt(x) is just y = (3-x*y*y)*y/2. That's all straightforward multiplication.

Log2 approximation in fixed-point

By simply counting the leading zero bits in a fixed-point number x, one can determine log2(x) to the closest strictly smaller integer. On many processor architectures, there is a "count leading zeros" machine instruction or intrinsic. Where this is not available, a fairly efficient implementation of clz() can be constructed in a variety of ways, one of which is included in the code below.

To compute the fractional part of the logarithm, the two main obvious contenders are interpolation in a table and minimax polynomial approximation. In this specific case, quadratic interpolation in a fairly small table seems to be the more attractive option. x = 2ⁱ * (1+f), with 0 ≤ f < 1. We determine i as described above and use the leading bits of f to index into the table. A parabola is fit through this and two following table entries, computing the parameters of the parabola on the fly. The result is rounded, and a heuristic adjustment is applied to partially compensate for the truncating nature of fixed-point arithmetic. Finally, the integer portion is added, yielding the final result.

It should be noted that the computation involves right shifts of signed integers which may be negative. We need those right shifts to map to arithmetic right shifts at machine code level, something which is not guaranteed by the ISO-C standard. However, in practice most compilers do what is desired. In this case I used the Intel compiler on an x64 platform running Windows.

With a 66-entry table of 32-bit words, the maximum absolute error can be reduced to 8.18251e-6, so full s15.16 accuracy is achieved.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>

#define FRAC_BITS_OUT (16)
#define INT_BITS_OUT  (15)
#define FRAC_BITS_IN  (31)
#define INT_BITS_IN   ( 0)

/* count leading zeros: intrinsic or machine instruction on many architectures */
int32_t clz (uint32_t x)
{
    uint32_t n, y;

    n = 31 + (!x);
    if ((y = (x & 0xffff0000U))) { n -= 16;  x = y; }
    if ((y = (x & 0xff00ff00U))) { n -=  8;  x = y; }
    if ((y = (x & 0xf0f0f0f0U))) { n -=  4;  x = y; }
    if ((y = (x & 0xccccccccU))) { n -=  2;  x = y; }
    if ((    (x & 0xFast Fixed Point Pow, Log, Exp and SqrtU))) { n -=  1;         }
    return n;
}

#define LOG2_TBL_SIZE (6)
#define TBL_SIZE      ((1 << LOG2_TBL_SIZE) + 2)

/* for i = [0,65]: log2(1 + i/64) * (1 << 31) */
const uint32_t log2Tab [TBL_SIZE] =
{
    0x00000000, 0x02dcf2d1, 0x05aeb4dd, 0x08759c50, 
    0x0b31fb7d, 0x0de42120, 0x108c588d, 0x132ae9e2, 
    0x15c01a3a, 0x184c2bd0, 0x1acf5e2e, 0x1d49ee4c, 
    0x1fbc16b9, 0x22260fb6, 0x24880f56, 0x26e2499d, 
    0x2934f098, 0x2b803474, 0x2dc4439b, 0x30014ac6, 
    0x32377512, 0x3466ec15, 0x368fd7ee, 0x38b25f5a, 
    0x3acea7c0, 0x3ce4d544, 0x3ef50ad2, 0x40ff6a2e, 
    0x43041403, 0x450327eb, 0x46fcc47a, 0x48f10751, 
    0x4ae00d1d, 0x4cc9f1ab, 0x4eaecfeb, 0x508ec1fa, 
    0x5269e12f, 0x5440461c, 0x5612089a, 0x57df3fd0, 
    0x59a80239, 0x5b6c65aa, 0x5d2c7f59, 0x5ee863e5, 
    0x60a02757, 0x6253dd2c, 0x64039858, 0x65af6b4b, 
    0x675767f5, 0x68fb9fce, 0x6a9c23d6, 0x6c39049b, 
    0x6dd2523d, 0x6f681c73, 0x70fa728c, 0x72896373, 
    0x7414fdb5, 0x759d4f81, 0x772266ad, 0x78a450b8, 
    0x7a231ace, 0x7b9ed1c7, 0x7d17822f, 0x7e8d3846, 
    0x80000000, 0x816fe50b
};

#define RND_SHIFT     (31 - FRAC_BITS_OUT)
#define RND_CONST     ((1 << RND_SHIFT) / 2)
#define RND_ADJUST    (0x10d) /* established heuristically */

/* 
   compute log2(x) in s15.16 format, where x is in s0.31 format
   maximum absolute error 8.18251e-6 @ 0x20352845 (0.251622232)
*/   
int32_t fixed_log2 (int32_t x)
{
    int32_t f1, f2, dx, a, b, approx, lz, i, idx;
    uint32_t t;

    /* x = 2**i * (1 + f), 0 <= f < 1. Find i */
    lz = clz (x);
    i = INT_BITS_IN - lz;
    /* normalize f */
    t = (uint32_t)x << (lz + 1);
    /* index table of log2 values using LOG2_TBL_SIZE msbs of fraction */
    idx = t >> (32 - LOG2_TBL_SIZE);
    /* difference between argument and smallest sampling point */
    dx = t - (idx << (32 - LOG2_TBL_SIZE));
    /* fit parabola through closest three sampling points; find coeffs a, b */
    f1 = (log2Tab[idx+1] - log2Tab[idx]);
    f2 = (log2Tab[idx+2] - log2Tab[idx]);
    a = f2 - (f1 << 1);
    b = (f1 << 1) - a;
    /* find function value for argument by computing ((a*dx+b)*dx) */
    approx = (int32_t)((((int64_t)a)*dx) >> (32 - LOG2_TBL_SIZE)) + b;
    approx = (int32_t)((((int64_t)approx)*dx) >> (32 - LOG2_TBL_SIZE + 1));
    approx = log2Tab[idx] + approx;
    /* round fractional part of result */
    approx = (((uint32_t)approx) + RND_CONST + RND_ADJUST) >> RND_SHIFT;
    /* combine integer and fractional parts of result */
    return (i << FRAC_BITS_OUT) + approx;
}

/* convert from s15.16 fixed point to double-precision floating point */
double fixed_to_float_s15_16 (int32_t a)
{
    return a / 65536.0;
}

/* convert from s0.31 fixed point to double-precision floating point */
double fixed_to_float_s0_31 (int32_t a)
{
    return a / (65536.0 * 32768.0);
}

int main (void)
{
    double a, res, ref, err, maxerr = 0.0;
    int32_t x, start, end;

    start = 0x00000001;
    end =   0x7fffffff;
    printf ("testing fixed_log2 with inputs in [%17.10e, %17.10e)\n",  
            fixed_to_float_s0_31 (start), fixed_to_float_s0_31 (end));

    for (x = start; x < end; x++) {
        a = fixed_to_float_s0_31 (x);
        ref = log2 (a);
        res = fixed_to_float_s15_16 (fixed_log2 (x));
        err = fabs (res - ref);
        if (err > maxerr) {
            maxerr = err;
        }
    }

    printf ("max. err = %g\n", maxerr);
    return EXIT_SUCCESS;
}

For completeness, I am showing the minimax polynomial approximation below. The coefficients for such approximations can be generated by several tools such as Maple, Mathematica, Sollya or with homebrew code using the Remez algorithm, which is what I used here. The code below shows the original floating-point coefficients, the dynamic scaling used to maximize accuracy in intermediate computation, and the heuristic adjustments applied to mitigate the impact of non-rounding fixed-point arithmetic.

A typical approach for computation of log2(x) is to use x = 2ⁱ * (1+f) and use approximation of log2(1+f) for (1+f) in [√½, √2], which means that we use a polynomial p(f) on the primary approximation interval [√½-1, √2-1].

The intermediate computation scales up operands as far as feasible for improved accuracy under the restriction that we want to use a 32-bit mulhi operation as its basic building block, as this is a native instruction on many 32-bit architectures, accessible either via inline machine code or as an intrinsic. As in the table-based code, there are right shifts of signed data which may be negative, and such right shifts must map to arithmetic right shifts, something that ISO-C doesn't guarantee but most C compilers do.

I managed to get the maximum absolute error for this variant down to 1.11288e-5, so almost full s15.16 accuracy but slightly worse than for the table-based variant. I suspect I should have added one additional term to the polynomial.

/* on 32-bit architectures, there is often an instruction/intrinsic for this */
int32_t mulhi (int32_t a, int32_t b)
{
    return (int32_t)(((int64_t)a * (int64_t)b) >> 32);
}

#define RND_SHIFT  (25 - FRAC_BITS_OUT)
#define RND_CONST  ((1 << RND_SHIFT) / 2)
#define RND_ADJUST (-2) /* established heuristically */

/* 
    compute log2(x) in s15.16 format, where x is in s0.31 format
    maximum absolute error 1.11288e-5 @ 0x5a82689f (0.707104757)
*/   
int32_t fixed_log2 (int32_t x)
{
    int32_t lz, i, f, p, approx;
    uint32_t t;
    /* x = 2**i * (1 + f), 0 <= f < 1. Find i */
    lz = clz (x);
    i = INT_BITS_IN - lz;
    /* force (1+f) into range [sqrt(0.5), sqrt(2)] */
    t = (uint32_t)x << lz;    
    if (t > (uint32_t)(1.414213562 * (1U << 31))) {
        i++;
        t = t >> 1;
    }
    /* compute log2(1+f) for f in [-0.2929, 0.4142] */
    f = t - (1U << 31);
    p =              + (int32_t)(-0.206191055 * (1U << 31) -  1);
    p = mulhi (p, f) + (int32_t)( 0.318199910 * (1U << 30) - 18);
    p = mulhi (p, f) + (int32_t)(-0.366491705 * (1U << 29) + 22);
    p = mulhi (p, f) + (int32_t)( 0.479811855 * (1U << 28) -  2);
    p = mulhi (p, f) + (int32_t)(-0.721206390 * (1U << 27) + 37);
    p = mulhi (p, f) + (int32_t)( 0.442701618 * (1U << 26) + 35);
    p = mulhi (p, f) + (f >> (31 - 25));
    /* round fractional part of the result */
    approx = (p + RND_CONST + RND_ADJUST) >> RND_SHIFT;
    /* combine integer and fractional parts of result */
    return (i << FRAC_BITS_OUT) + approx;
}

Something really weird with C++ exp() function

With this code

#include <iostream>
#include <cmath>
#include <cstdio>
using std::cout;

int mySqrt(int x) {
    // For x = 2147395600
    cout << exp(0.5*log(x)) << "  ";     // It prints 46340
    return exp(0.5*log(x));              // But returns 46349
}

int main(void) {
    std::cout << mySqrt(2147395600) << "\n";
    printf("%.30f\n", exp(0.5*log(2147395600)));
    return 0;
}

I got output:

46340  46339
46339.999999999978172127157449722290

It seems the value is rounded when passed to cout while truncated when converted to int.

Fast Fixed Point Pow, Log, Exp and Sqrt