Efficient Implementation of Log2(_M256D) in Avx2

Efficient implementation of log2(__m256d) in AVX2

Finally here is my best result which on Ryzen 1800X @3.6GHz gives about 0.8 billion of logarithms per second (200 million vectors of 4 logarithms in each) in a single thread, and is accurate till a few last bits in the mantissa. Spoiler: see in the end how to increase performance to 0.87 billion logarithms per second.

Special cases:
Negative numbers, negative infinity and NaNs with negative sign bit are treated as if they are very close to 0 (result in some garbage large negative "logarithm" values). Positive infinity and NaNs with positive sign bit result in a logarithm around 1024. If you don't like how special cases are treated, one option is to add code that checks for them and does what suits you better. This will make the computation slower.

namespace {
  // The limit is 19 because we process only high 32 bits of doubles, and out of
  //   20 bits of mantissa there, 1 bit is used for rounding.
  constexpr uint8_t cnLog2TblBits = 10; // 1024 numbers times 8 bytes = 8KB.
  constexpr uint16_t cZeroExp = 1023;
  const __m256i gDoubleNotExp = _mm256_set1_epi64x(~(0x7ffULL << 52));
  const __m256d gDoubleExp0 = _mm256_castsi256_pd(_mm256_set1_epi64x(1023ULL << 52));
  const __m256i cAvxExp2YMask = _mm256_set1_epi64x(
    ~((1ULL << (52-cnLog2TblBits)) - 1) );
  const __m256d cPlusBit = _mm256_castsi256_pd(_mm256_set1_epi64x(
    1ULL << (52 - cnLog2TblBits - 1)));
  const __m256d gCommMul1 = _mm256_set1_pd(2.0 / 0.693147180559945309417); // 2.0/ln(2)
  const __m256i gHigh32Permute = _mm256_set_epi32(0, 0, 0, 0, 7, 5, 3, 1);
  const __m128i cSseMantTblMask = _mm_set1_epi32((1 << cnLog2TblBits) - 1);
  const __m128i gExpNorm0 = _mm_set1_epi32(1023);
  // plus |cnLog2TblBits|th highest mantissa bit
  double gPlusLog2Table[1 << cnLog2TblBits];
} // anonymous namespace

void InitLog2Table() {
  for(uint32_t i=0; i<(1<<cnLog2TblBits); i++) {
    const uint64_t iZp = (uint64_t(cZeroExp) << 52)
      | (uint64_t(i) << (52 - cnLog2TblBits)) | (1ULL << (52 - cnLog2TblBits - 1));
    const double zp = *reinterpret_cast<const double*>(&iZp);
    const double l2zp = std::log2(zp);
    gPlusLog2Table[i] = l2zp;
  }
}

__m256d __vectorcall Log2TblPlus(__m256d x) {
  const __m256d zClearExp = _mm256_and_pd(_mm256_castsi256_pd(gDoubleNotExp), x);
  const __m256d z = _mm256_or_pd(zClearExp, gDoubleExp0);

  const __m128i high32 = _mm256_castsi256_si128(_mm256_permutevar8x32_epi32(
    _mm256_castpd_si256(x), gHigh32Permute));
  // This requires that x is non-negative, because the sign bit is not cleared before
  //   computing the exponent.
  const __m128i exps32 = _mm_srai_epi32(high32, 20);
  const __m128i normExps = _mm_sub_epi32(exps32, gExpNorm0);

  // Compute y as approximately equal to log2(z)
  const __m128i indexes = _mm_and_si128(cSseMantTblMask,
    _mm_srai_epi32(high32, 20 - cnLog2TblBits));
  const __m256d y = _mm256_i32gather_pd(gPlusLog2Table, indexes,
    /*number of bytes per item*/ 8);
  // Compute A as z/exp2(y)
  const __m256d exp2_Y = _mm256_or_pd(
    cPlusBit, _mm256_and_pd(z, _mm256_castsi256_pd(cAvxExp2YMask)));

  // Calculate t=(A-1)/(A+1). Both numerator and denominator would be divided by exp2_Y
  const __m256d tNum = _mm256_sub_pd(z, exp2_Y);
  const __m256d tDen = _mm256_add_pd(z, exp2_Y);

  // Compute the first polynomial term from "More efficient series" of https://en.wikipedia.org/wiki/Logarithm#Power_series
  const __m256d t = _mm256_div_pd(tNum, tDen);

  const __m256d log2_z = _mm256_fmadd_pd(t, gCommMul1, y);

  // Leading integer part for the logarithm
  const __m256d leading = _mm256_cvtepi32_pd(normExps);

  const __m256d log2_x = _mm256_add_pd(log2_z, leading);
  return log2_x;
}

It uses a combination of lookup table approach and a 1st degree polynomial, mostly described on Wikipedia (the link is in the code comments). I can afford to allocate 8KB of L1 cache here (which is a half of 16KB L1 cache available per logical core), because logarithm computation is really the bottleneck for me and there is not much more anything that needs L1 cache.

However, if you need more L1 cache for the other needs, you can decrease the amount of cache used by logarithm algorithm by reducing cnLog2TblBits to e.g. 5 at expense of decreasing the accuracy of logarithm computation.

Or to keep the accuracy high, you can increase the number of polynomial terms by adding:

namespace {
  // ...
  const __m256d gCoeff1 = _mm256_set1_pd(1.0 / 3);
  const __m256d gCoeff2 = _mm256_set1_pd(1.0 / 5);
  const __m256d gCoeff3 = _mm256_set1_pd(1.0 / 7);
  const __m256d gCoeff4 = _mm256_set1_pd(1.0 / 9);
  const __m256d gCoeff5 = _mm256_set1_pd(1.0 / 11);
}

And then changing the tail of Log2TblPlus() after line const __m256d t = _mm256_div_pd(tNum, tDen);:

  const __m256d t2 = _mm256_mul_pd(t, t); // t**2

  const __m256d t3 = _mm256_mul_pd(t, t2); // t**3
  const __m256d terms01 = _mm256_fmadd_pd(gCoeff1, t3, t);
  const __m256d t5 = _mm256_mul_pd(t3, t2); // t**5
  const __m256d terms012 = _mm256_fmadd_pd(gCoeff2, t5, terms01);
  const __m256d t7 = _mm256_mul_pd(t5, t2); // t**7
  const __m256d terms0123 = _mm256_fmadd_pd(gCoeff3, t7, terms012);
  const __m256d t9 = _mm256_mul_pd(t7, t2); // t**9
  const __m256d terms01234 = _mm256_fmadd_pd(gCoeff4, t9, terms0123);
  const __m256d t11 = _mm256_mul_pd(t9, t2); // t**11
  const __m256d terms012345 = _mm256_fmadd_pd(gCoeff5, t11, terms01234);

  const __m256d log2_z = _mm256_fmadd_pd(terms012345, gCommMul1, y);

Then comment // Leading integer part for the logarithm and the rest unchanged follow.

Normally you don't need that many terms, even for a few-bit table, I just provided the coefficients and computations for reference. It's likely that if cnLog2TblBits==5, you won't need anything beyond terms012. But I haven't done such measurements, you need to experiment what suits your needs.

The less polynomial terms you compute, obviously, the faster the computations are.

EDIT: this question In what situation would the AVX2 gather instructions be faster than individually loading the data? suggests that you may get a performance improvement if

const __m256d y = _mm256_i32gather_pd(gPlusLog2Table, indexes,
  /*number of bytes per item*/ 8);

is replaced by

const __m256d y = _mm256_set_pd(gPlusLog2Table[indexes.m128i_u32[3]],
  gPlusLog2Table[indexes.m128i_u32[2]],
  gPlusLog2Table[indexes.m128i_u32[1]],
  gPlusLog2Table[indexes.m128i_u32[0]]);

For my implementation it saves about 1.5 cycle, reducing the total cycle count to compute 4 logarithms from 18 to 16.5, thus the performance rises to 0.87 billion logarithms per second. I'm leaving the current implementation as is because it's more idiomatic and shoud be faster once the CPUs start doing gather operations right (with coalescing like GPUs do).

EDIT2: on Ryzen CPU (but not on Intel) you can get a little more speedup (about 0.5 cycle) by replacing

const __m128i high32 = _mm256_castsi256_si128(_mm256_permutevar8x32_epi32(
  _mm256_castpd_si256(x), gHigh32Permute));

with

  const __m128 hiLane = _mm_castpd_ps(_mm256_extractf128_pd(x, 1));
  const __m128 loLane = _mm_castpd_ps(_mm256_castpd256_pd128(x));
  const __m128i high32 = _mm_castps_si128(_mm_shuffle_ps(loLane, hiLane,
    _MM_SHUFFLE(3, 1, 3, 1)));

AVX log intrinsics (_mm256_log_ps) missing in g++-4.8?

As indicated in the comments to your question, that intrinsic doesn't map to an actual AVX instruction; it is an Intel extension to the intrinsic set. The implementation likely uses many underlying instructions, as a logarithm isn't a trivial operation.

If you'd like to use a non-Intel compiler but want a fast logarithm implementation, you might check out this open-source implementation of sin(), cos(), exp(), and log() functions using AVX. They are based on an earlier SSE2 version of the same functions.

simd: round up (ceil) the log2 of an input, while clamping negative logs to zero?

since the range will be limited (such as [0, 16] in the extreme case

Oh, this doesn't need to work for numbers greater than INT_MAX, up to UINT_MAX? That's much easier than the problem as stated at the top of the question. Yeah just _mm_ceil_ps and use a signed conversion to epi32 (int32_t) and use _mm_min_epi32 for the upper limit, and probably _mm_max_epi32 for the lower limit. (Only one instruction instead of shift/and).

Or possibly _mm_sub_ps to range-shift to -16..0 / _mm_cvttps_epi32 to truncate (upwards towards zero), then integer subtract from zero. _mm_ceil_ps costs 2 uops on most CPUs, so that's about break-even, trading an FP operation for integer though. But requires more setup.

Integer min/max are cheaper (lower latency, and better throughput) than FP, so prefer clamp after conversion. Out-of-range floats convert to INT_MIN (high-bit set, others zero, what Intel calls the "integer indefinite" value) so will clamp to 0.

If you have a lot of this to do in a loop that doesn't do other FP computation, change the MXCSR rounding mode for this loop to round towards +Inf. Use _mm_cvtps_epi32 (which uses the current FP rounding mode, like lrint / (int)nearbyint) instead of ceil + cvtt (truncation).

This use-case: `ceil(log2(float))`

You could just pull that out of the FP bit pattern directly, and round up based on a non-zero mantissa. Binary floating point already contains a power-of-2 exponent field, so you just need to extract that with a bit of massaging.

Like _mm_and_ps / _mm_cmpeq_epi32 / _mm_add_epi32 to add the -1 compare result for FP values with a zero mantissa, so you treat powers of 2 differently from anything higher.

Should be faster than computing an FP log base e with a fractional part, even if it's only a quick approximation. Values smaller than 1.0 whose biased exponent is negative may need some extra handling.

Also, since you want all four indices, probably faster to just store to an array of 4 uint32_t values and access it, instead of using movd + 3x pextrd.

An even better way to round up to the next exponent for floats with a non-zero mantissa is to simply do an integer add of 0x007fffff to the bit-pattern. (23 set bits: https://en.wikipedia.org/wiki/Single-precision_floating-point_format).

// we round up the exponent separately from unbiasing.
// Might be possible to do better
__m128i ceil_log2_not_fully_optimized(__m128 v)
{
    // round up to the next power of 2 (exponent value) by carry-out from mantissa field into exponent
    __m128i floatbits = _mm_add_epi32(_mm_castps_si128(v), _mm_set1_epi32(0x007fffff));    

    __m128i exp = _mm_srai_epi32(floatbits, 23);   // arithmetic shift so negative numbers stay negative
    exp = _mm_sub_epi32(exp, _mm_set1_epi32(127));  // undo the bias
    exp = _mm_max_epi32(exp, _mm_setzero_si128());  // clamp negative numbers to zero.
    return exp;
}

If the exponent field was already all-ones, that means +Inf with an all-zero mantissa, else NaN. So it's only possible for carry propagation from the first add to flip the sign bit if the input was already NaN. +Inf gets treated as one exponent higher than FLT_MAX. 0.0 and 0.01 should both come out to 0, if I got this right.

According to GCC on Godbolt, I think so: https://godbolt.org/z/9G9orWj16 GCC doesn't fully constant-propagate through it, so we can actually see the input to pmaxsd, and see that 0.0 and 0.01 come out to max(0, -127) and max(0,-3) = 0 each. And 3.0 and 4.0 both come out to max(0, 2) = 2.

We can maybe even combine that +0x7ff... idea with adding a negative number to the exponent field to undo the bias.

Or to get carry-out into the sign bit correct, subtracting from it, with a 1 in the mantissa field so an all-zero mantissa will propagate a borrow and subtract one more from the exponent field? But small exponents less than the bias could still carry/borrow out and flip the sign bit. But that might be ok if we're going to clamp such values to zero anyway if they come out as small positive instead of negative.

I haven't worked out the details of this yet; if we need to handle the original input being zero, this might be a problem. If we can assume the original sign bit was cleared, but log(x) might be negative (i.e. exponent field below the bias), this should work just fine; carry out of the exponent field into the sign bit is exactly what we want in that case, so srai keeps it negative and max chooses the 0.

    // round up to the next power of 2 (exponent value) while undoing the bias
    const uint32_t f32_unbias = ((-127)<<23) + 0x007fffffU;
    ???
    profit

Efficient (on Ryzen) way to extract the odd elements of a m256 into a m128?

On Intel, your code would be optimal. One 1-uop instruction is the best you will get. (Except you might want to use vpermps to avoid any risk for int / FP bypass delay, if your input vector was created by a pd instruction rather than a load or something. Using the result of an FP shuffle as an input to integer instructions is usually fine on Intel, but I'm less sure about feeding the result of an FP instruction to an integer shuffle.)

Although if tuning for Intel, you might try changing the surrounding code so you can shuffle into the bottom 64-bits of each 128b lane, to avoid using a lane-crossing shuffle. (Then you could just use vshufps ymm, or if tuning for KNL, vpermilps since 2-input vshufps is slower.)

With AVX512, there's _mm256_cvtepi64_epi32 (vpmovqd) which packs elements across lanes, with truncation.

On Ryzen, lane-crossing shuffles are slow. Agner Fog doesn't have numbers for vpermd, but he lists vpermps (which probably uses the same hardware internally) at 3 uops, 5c latency, one per 4c throughput.

vextractf128 xmm, ymm, 1 is very efficient on Ryzen (1c latency, 0.33c throughput), not surprising since it tracks 256b registers as two 128b halves already. shufps is also efficient (1c latency, 0.5c throughput), and will let you shuffle the two 128b registers into the result you want.

This also saves you 2 registers for the 2 vpermps shuffle masks you don't need anymore.

So I'd suggest:

__m256d x = /* computed here */;

// Tuned for Ryzen.  Sub-optimal on Intel
__m128 hi = _mm_castpd_ps(_mm256_extractf128_pd(x, 1));
__m128 lo = _mm_castpd_ps(_mm256_castpd256_pd128(x));
__m128 odd  = _mm_shuffle_ps(lo, hi, _MM_SHUFFLE(3,1,3,1));
__m128 even = _mm_shuffle_ps(lo, hi, _MM_SHUFFLE(2,0,2,0));

On Intel, using 3 shuffles instead of 2 gives you 2/3rds of the optimal throughput, with 1c extra latency for the first result.

Fastest Implementation of Exponential Function Using AVX

The exp function from avx_mathfun uses range reduction in combination with a Chebyshev approximation-like polynomial to compute 8 exp-s in parallel with AVX instructions. Use the right compiler settings to make sure that addps and mulps are fused to FMA instructions, where possible.

It is quite straightforward to adapt the original exp code from avx_mathfun to portable (across different compilers) C / AVX2 intrinsics code. The original code uses gcc style alignment attributes and ingenious macro's. The modified code, which uses the standard _mm256_set1_ps() instead, is below the small test code and the table. The modified code requires AVX2.

The following code is used for a simple test:

int main(){
    int i;
    float xv[8];
    float yv[8];
    __m256 x = _mm256_setr_ps(1.0f, 2.0f, 3.0f ,4.0f ,5.0f, 6.0f, 7.0f, 8.0f);
    __m256 y = exp256_ps(x);
    _mm256_store_ps(xv,x);
    _mm256_store_ps(yv,y);

    for (i=0;i<8;i++){
        printf("i = %i, x = %e, y = %e \n",i,xv[i],yv[i]);
    }
    return 0;
}

The output seems to be ok:

i = 0, x = 1.000000e+00, y = 2.718282e+00 
i = 1, x = 2.000000e+00, y = 7.389056e+00 
i = 2, x = 3.000000e+00, y = 2.008554e+01 
i = 3, x = 4.000000e+00, y = 5.459815e+01 
i = 4, x = 5.000000e+00, y = 1.484132e+02 
i = 5, x = 6.000000e+00, y = 4.034288e+02 
i = 6, x = 7.000000e+00, y = 1.096633e+03 
i = 7, x = 8.000000e+00, y = 2.980958e+03

The modified code (AVX2) is:

#include <stdio.h>
#include <immintrin.h>
/*     gcc -O3 -m64 -Wall -mavx2 -march=broadwell  expc.c    */

__m256 exp256_ps(__m256 x) {
/* Modified code. The original code is here: https://github.com/reyoung/avx_mathfun

   AVX implementation of exp
   Based on "sse_mathfun.h", by Julien Pommier
   http://gruntthepeon.free.fr/ssemath/
   Copyright (C) 2012 Giovanni Garberoglio
   Interdisciplinary Laboratory for Computational Science (LISC)
   Fondazione Bruno Kessler and University of Trento
   via Sommarive, 18
   I-38123 Trento (Italy)
  This software is provided 'as-is', without any express or implied
  warranty.  In no event will the authors be held liable for any damages
  arising from the use of this software.
  Permission is granted to anyone to use this software for any purpose,
  including commercial applications, and to alter it and redistribute it
  freely, subject to the following restrictions:
  1. The origin of this software must not be misrepresented; you must not
     claim that you wrote the original software. If you use this software
     in a product, an acknowledgment in the product documentation would be
     appreciated but is not required.
  2. Altered source versions must be plainly marked as such, and must not be
     misrepresented as being the original software.
  3. This notice may not be removed or altered from any source distribution.
  (this is the zlib license)
*/
/* 
  To increase the compatibility across different compilers the original code is
  converted to plain AVX2 intrinsics code without ingenious macro's,
  gcc style alignment attributes etc. The modified code requires AVX2
*/
__m256   exp_hi        = _mm256_set1_ps(88.3762626647949f);
__m256   exp_lo        = _mm256_set1_ps(-88.3762626647949f);

__m256   cephes_LOG2EF = _mm256_set1_ps(1.44269504088896341);
__m256   cephes_exp_C1 = _mm256_set1_ps(0.693359375);
__m256   cephes_exp_C2 = _mm256_set1_ps(-2.12194440e-4);

__m256   cephes_exp_p0 = _mm256_set1_ps(1.9875691500E-4);
__m256   cephes_exp_p1 = _mm256_set1_ps(1.3981999507E-3);
__m256   cephes_exp_p2 = _mm256_set1_ps(8.3334519073E-3);
__m256   cephes_exp_p3 = _mm256_set1_ps(4.1665795894E-2);
__m256   cephes_exp_p4 = _mm256_set1_ps(1.6666665459E-1);
__m256   cephes_exp_p5 = _mm256_set1_ps(5.0000001201E-1);
__m256   tmp           = _mm256_setzero_ps(), fx;
__m256i  imm0;
__m256   one           = _mm256_set1_ps(1.0f);

        x     = _mm256_min_ps(x, exp_hi);
        x     = _mm256_max_ps(x, exp_lo);

  /* express exp(x) as exp(g + n*log(2)) */
        fx    = _mm256_mul_ps(x, cephes_LOG2EF);
        fx    = _mm256_add_ps(fx, _mm256_set1_ps(0.5f));
        tmp   = _mm256_floor_ps(fx);
__m256  mask  = _mm256_cmp_ps(tmp, fx, _CMP_GT_OS);    
        mask  = _mm256_and_ps(mask, one);
        fx    = _mm256_sub_ps(tmp, mask);
        tmp   = _mm256_mul_ps(fx, cephes_exp_C1);
__m256  z     = _mm256_mul_ps(fx, cephes_exp_C2);
        x     = _mm256_sub_ps(x, tmp);
        x     = _mm256_sub_ps(x, z);
        z     = _mm256_mul_ps(x,x);

__m256  y     = cephes_exp_p0;
        y     = _mm256_mul_ps(y, x);
        y     = _mm256_add_ps(y, cephes_exp_p1);
        y     = _mm256_mul_ps(y, x);
        y     = _mm256_add_ps(y, cephes_exp_p2);
        y     = _mm256_mul_ps(y, x);
        y     = _mm256_add_ps(y, cephes_exp_p3);
        y     = _mm256_mul_ps(y, x);
        y     = _mm256_add_ps(y, cephes_exp_p4);
        y     = _mm256_mul_ps(y, x);
        y     = _mm256_add_ps(y, cephes_exp_p5);
        y     = _mm256_mul_ps(y, z);
        y     = _mm256_add_ps(y, x);
        y     = _mm256_add_ps(y, one);

  /* build 2^n */
        imm0  = _mm256_cvttps_epi32(fx);
        imm0  = _mm256_add_epi32(imm0, _mm256_set1_epi32(0x7f));
        imm0  = _mm256_slli_epi32(imm0, 23);
__m256  pow2n = _mm256_castsi256_ps(imm0);
        y     = _mm256_mul_ps(y, pow2n);
        return y;
}

int main(){
    int i;
    float xv[8];
    float yv[8];
    __m256 x = _mm256_setr_ps(1.0f, 2.0f, 3.0f ,4.0f ,5.0f, 6.0f, 7.0f, 8.0f);
    __m256 y = exp256_ps(x);
    _mm256_store_ps(xv,x);
    _mm256_store_ps(yv,y);

    for (i=0;i<8;i++){
        printf("i = %i, x = %e, y = %e \n",i,xv[i],yv[i]);
    }
    return 0;
}

As @Peter Cordes points out,
it should be possible to replace the _mm256_floor_ps(fx + 0.5f) by
_mm256_round_ps(fx). Moreover, the mask = _mm256_cmp_ps(tmp, fx, _CMP_GT_OS); and the next two lines seem to be redundant.
Further optimizations are possible by combining cephes_exp_C1 and cephes_exp_C2 into inv_LOG2EF.
This leads to the following code which has not been tested thoroughly!

#include <stdio.h>
#include <immintrin.h>
#include <math.h>
/*    gcc -O3 -m64 -Wall -mavx2 -march=broadwell  expc.c -lm     */

__m256 exp256_ps(__m256 x) {
/* Modified code from this source: https://github.com/reyoung/avx_mathfun

   AVX implementation of exp
   Based on "sse_mathfun.h", by Julien Pommier
   http://gruntthepeon.free.fr/ssemath/
   Copyright (C) 2012 Giovanni Garberoglio
   Interdisciplinary Laboratory for Computational Science (LISC)
   Fondazione Bruno Kessler and University of Trento
   via Sommarive, 18
   I-38123 Trento (Italy)
  This software is provided 'as-is', without any express or implied
  warranty.  In no event will the authors be held liable for any damages
  arising from the use of this software.
  Permission is granted to anyone to use this software for any purpose,
  including commercial applications, and to alter it and redistribute it
  freely, subject to the following restrictions:
  1. The origin of this software must not be misrepresented; you must not
     claim that you wrote the original software. If you use this software
     in a product, an acknowledgment in the product documentation would be
     appreciated but is not required.
  2. Altered source versions must be plainly marked as such, and must not be
     misrepresented as being the original software.
  3. This notice may not be removed or altered from any source distribution.
  (this is the zlib license)

*/
/* 
  To increase the compatibility across different compilers the original code is
  converted to plain AVX2 intrinsics code without ingenious macro's,
  gcc style alignment attributes etc.
  Moreover, the part "express exp(x) as exp(g + n*log(2))" has been significantly simplified.
  This modified code is not thoroughly tested!
*/


__m256   exp_hi        = _mm256_set1_ps(88.3762626647949f);
__m256   exp_lo        = _mm256_set1_ps(-88.3762626647949f);

__m256   cephes_LOG2EF = _mm256_set1_ps(1.44269504088896341f);
__m256   inv_LOG2EF    = _mm256_set1_ps(0.693147180559945f);

__m256   cephes_exp_p0 = _mm256_set1_ps(1.9875691500E-4);
__m256   cephes_exp_p1 = _mm256_set1_ps(1.3981999507E-3);
__m256   cephes_exp_p2 = _mm256_set1_ps(8.3334519073E-3);
__m256   cephes_exp_p3 = _mm256_set1_ps(4.1665795894E-2);
__m256   cephes_exp_p4 = _mm256_set1_ps(1.6666665459E-1);
__m256   cephes_exp_p5 = _mm256_set1_ps(5.0000001201E-1);
__m256   fx;
__m256i  imm0;
__m256   one           = _mm256_set1_ps(1.0f);

        x     = _mm256_min_ps(x, exp_hi);
        x     = _mm256_max_ps(x, exp_lo);

  /* express exp(x) as exp(g + n*log(2)) */
        fx     = _mm256_mul_ps(x, cephes_LOG2EF);
        fx     = _mm256_round_ps(fx, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC);
__m256  z      = _mm256_mul_ps(fx, inv_LOG2EF);
        x      = _mm256_sub_ps(x, z);
        z      = _mm256_mul_ps(x,x);

__m256  y      = cephes_exp_p0;
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p1);
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p2);
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p3);
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p4);
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p5);
        y      = _mm256_mul_ps(y, z);
        y      = _mm256_add_ps(y, x);
        y      = _mm256_add_ps(y, one);

  /* build 2^n */
        imm0   = _mm256_cvttps_epi32(fx);
        imm0   = _mm256_add_epi32(imm0, _mm256_set1_epi32(0x7f));
        imm0   = _mm256_slli_epi32(imm0, 23);
__m256  pow2n  = _mm256_castsi256_ps(imm0);
        y      = _mm256_mul_ps(y, pow2n);
        return y;
}

int main(){
    int i;
    float xv[8];
    float yv[8];
    __m256 x = _mm256_setr_ps(11.0f, -12.0f, 13.0f ,-14.0f ,15.0f, -16.0f, 17.0f, -18.0f);
    __m256 y = exp256_ps(x);
    _mm256_store_ps(xv,x);
    _mm256_store_ps(yv,y);

 /* compare exp256_ps with the double precision exp from math.h, 
    print the relative error             */
    printf("i      x                     y = exp256_ps(x)      double precision exp        relative error\n\n");
    for (i=0;i<8;i++){ 
        printf("i = %i  x =%16.9e   y =%16.9e   exp_dbl =%16.9e   rel_err =%16.9e\n",
           i,xv[i],yv[i],exp((double)(xv[i])),
           ((double)(yv[i])-exp((double)(xv[i])))/exp((double)(xv[i])) );
    }
    return 0;
}

The next table gives an impression of the accuracy in certain points, by comparing exp256_ps with the double precision exp from math.h .
The relative error is in the last column.

i      x                     y = exp256_ps(x)      double precision exp        relative error

i = 0  x = 1.000000000e+00   y = 2.718281746e+00   exp_dbl = 2.718281828e+00   rel_err =-3.036785947e-08
i = 1  x =-2.000000000e+00   y = 1.353352815e-01   exp_dbl = 1.353352832e-01   rel_err =-1.289636419e-08
i = 2  x = 3.000000000e+00   y = 2.008553696e+01   exp_dbl = 2.008553692e+01   rel_err = 1.672817689e-09
i = 3  x =-4.000000000e+00   y = 1.831563935e-02   exp_dbl = 1.831563889e-02   rel_err = 2.501162103e-08
i = 4  x = 5.000000000e+00   y = 1.484131622e+02   exp_dbl = 1.484131591e+02   rel_err = 2.108215155e-08
i = 5  x =-6.000000000e+00   y = 2.478752285e-03   exp_dbl = 2.478752177e-03   rel_err = 4.380257261e-08
i = 6  x = 7.000000000e+00   y = 1.096633179e+03   exp_dbl = 1.096633158e+03   rel_err = 1.849522682e-08
i = 7  x =-8.000000000e+00   y = 3.354626242e-04   exp_dbl = 3.354626279e-04   rel_err =-1.101575118e-08

How conditionally negate an AVX2 int16_t vector based on another vector of 0 or 1 elements?

There is a quick way to conditionally negate, using _mm256_sign_epi16. The mask is not of the right form, but it can be transformed into the right form by adding 0x7FFF to every element, so:

__m256i masks = _mm256_add_epi16(beta, _mm256_set1_epi16(0x7FFF));
__m256i res = _mm256_add_epi16(a, _mm256_sign_epi16(b, masks));

AVX2 what is the most efficient way to pack left based on a mask?

AVX2 + BMI2. See my other answer for AVX512. (Update: saved a pdep in 64bit builds.)

We can use AVX2 vpermps (_mm256_permutevar8x32_ps) (or the integer equivalent, vpermd) to do a lane-crossing variable-shuffle.

We can generate masks on the fly, since BMI2 pext (Parallel Bits Extract) provides us with a bitwise version of the operation we need.

Beware that pdep/pext are very slow on AMD CPUs before Zen 3, like 6 uops / 18 cycle latency and throughput on Ryzen Zen 1 and Zen 2. This implementation will perform horribly on those AMD CPUs. For AMD, you might be best with 128-bit vectors using a pshufb or vpermilps LUT, or some of the AVX2 variable-shift suggestions discussed in comments. Especially if your mask input is a vector mask (not an already packed bitmask from memory).

AMD before Zen2 only has 128-bit vector execution units anyway, and 256-bit lane-crossing shuffles are slow. So 128-bit vectors are very attractive for this on Zen 1. But Zen 2 has 256-bit load/store and execution units. (And still slow microcoded pext/pdep.)

For integer vectors with 32-bit or wider elements: Either 1) _mm256_movemask_ps(_mm256_castsi256_ps(compare_mask)).

Or 2) use _mm256_movemask_epi8 and then change the first PDEP constant from 0x0101010101010101 to 0x0F0F0F0F0F0F0F0F to scatter blocks of 4 contiguous bits. Change the multiply by 0xFFU into expanded_mask |= expanded_mask<<4; or expanded_mask *= 0x11; (Not tested). Either way, use the shuffle mask with VPERMD instead of VPERMPS.

For 64-bit integer or double elements, everything still Just Works; The compare-mask just happens to always have pairs of 32-bit elements that are the same, so the resulting shuffle puts both halves of each 64-bit element in the right place. (So you still use VPERMPS or VPERMD, because VPERMPD and VPERMQ are only available with immediate control operands.)

For 16-bit elements, you might be able to adapt this with 128-bit vectors.

For 8-bit elements, see Efficient sse shuffle mask generation for left-packing byte elements for a different trick, storing the result in multiple possibly-overlapping chunks.

The algorithm:

Start with a constant of packed 3 bit indices, with each position holding its own index. i.e. [ 7 6 5 4 3 2 1 0 ] where each element is 3 bits wide. 0b111'110'101'...'010'001'000.

Efficient Implementation of Log2(_M256D) in Avx2