How to Speed Up This Histogram of Lut Lookups

How to speed up this histogram of LUT lookups?

This is essentially a histogram problem. You're mapping values 16-bit values to 5-bit values with a 32k-entry lookup table, but after that it's just histogramming the LUT results. Like ++counts[ b[a[j]] ];, where counts is bins[i]. See below for more about histograms.

First of all, you can use the smallest possible data-types to increase the density of your LUT (and of the original data). On x86, a zero or sign-extending load of 8-bit or 16-bit data into a register is almost exactly the same cost as a regular 32-bit int load (assuming both hit in cache), and an 8-bit or 16-bit store is also just as cheap as a 32-bit store.

Since your data size exceeds L1 d-cache size (32kiB for all recent Intel designs), and you access it in a scattered pattern, you have a lot to gain from shrinking your cache footprint. (For more x86 perf info, see the x86 tag wiki, especially Agner Fog's stuff).

Since a has less than 65536 entries in each plane, your bin counts will never overflow a 16-bit counter, so bins can be uint16_t as well.

Your copy() makes no sense. Why are you copying into b[32768] instead of having your inner loop use a pointer to the current LUT? You use it read-only. The only reason you'd still want to copy is to copy from int to uin8_t if you can't change the code that produces different LUTs to produce int8_t or uint8_t in the first place.

This version takes advantage of those ideas and a few histogram tricks, and compiles to asm that looks good (Godbolt compiler explorer: gcc6.2 -O3 -march=haswell (AVX2)):

// untested
//#include <algorithm>
#include <stdint.h>

const int PLANES = 1000;
void use_bins(uint16_t bins[PLANES][32]);   // pass the result to an extern function so it doesn't optimize away

                       // 65536 or higher triggers the static_assert
alignas(64) static uint16_t a[PLANES][1000];  // static/global, I guess?

void lut_and_histogram(uint8_t __restrict__ lut[32768])
{

    alignas(16) uint16_t bins[PLANES][32];  // don't zero the whole thing up front: that would evict more data from cache than necessary
                                            // Better would be zeroing the relevant plane of each bin right before using.
                                            // you pay the rep stosq startup overhead more times, though.

    for (int i = 0; i < PLANES;i++) {
        alignas(16) uint16_t tmpbins[4][32] = {0};
        constexpr int a_elems = sizeof(a[0])/sizeof(uint16_t);
        static_assert(a_elems > 1, "someone changed a[] into a* and forgot to update this code");
        static_assert(a_elems <= UINT16_MAX, "bins could overflow");
        const uint16_t *ai = a[i];
        for (int j = 0 ; j<a_elems ; j+=4) { //hashing a[i] to 32 bins.
            // Unrolling to separate bin arrays reduces serial dependencies
            // to avoid bottlenecks when the same bin is used repeatedly.
            // This has to be balanced against using too much L1 cache for the bins.

            // TODO: load a vector of data from ai[j] and unpack it with pextrw.
            // even just loading a uint64_t and unpacking it to 4 uint16_t would help.
            tmpbins[0][ lut[ai[j+0]] ]++;
            tmpbins[1][ lut[ai[j+1]] ]++;
            tmpbins[2][ lut[ai[j+2]] ]++;
            tmpbins[3][ lut[ai[j+3]] ]++;
            static_assert(a_elems % 4 == 0, "unroll factor doesn't divide a element count");
        }

        // TODO: do multiple a[i] in parallel instead of slicing up a single run.
        for (int k = 0 ; k<32 ; k++) {
            // gcc does auto-vectorize this with a short fully-unrolled VMOVDQA / VPADDW x3
            bins[i][k] = tmpbins[0][k] + tmpbins[1][k] +
                         tmpbins[2][k] + tmpbins[3][k];
        }
    }
  
    // do something with bins.  An extern function stops it from optimizing away.
    use_bins(bins);
}

The inner-loop asm looks like this:

.L2:
    movzx   ecx, WORD PTR [rdx]
        add     rdx, 8                    # pointer increment over ai[]
    movzx   ecx, BYTE PTR [rsi+rcx]
    add     WORD PTR [rbp-64272+rcx*2], 1     # memory-destination increment of a histogram element
    movzx   ecx, WORD PTR [rdx-6]
    movzx   ecx, BYTE PTR [rsi+rcx]
    add     WORD PTR [rbp-64208+rcx*2], 1
    ... repeated twice more

With those 32-bit offsets from rbp (instead of 8-bit offsets from rsp, or using another register :/) the code density isn't wonderful. Still, the average instruction length isn't so long that it's likely to bottleneck on instruction decode on any modern CPU.

A variation on multiple bins:

Since you need to do multiple histograms anyway, just do 4 to 8 of them in parallel instead of slicing the bins for a single histogram. The unroll factor doesn't even have to be a power of 2.

That eliminates the need for the bins[i][k] = sum(tmpbins[0..3][k]) loop over k at the end.

Zero bins[i..i+unroll_factor][0..31] right before use, instead of zeroing the whole thing outside the loop. That way all the bins will be hot in L1 cache when you start, and this work can overlap with the more load-heavy work of the inner loop.

Hardware prefetchers can keep track of multiple sequential streams, so don't worry about having a lot more cache misses in loading from a. (Also use vector loads for this, and slice them up after loading).

Other questions with useful answers about histograms:

Methods to vectorise histogram in SIMD? suggests the multiple-bin-arrays and sum at the end trick.
Optimizing SIMD histogram calculation x86 asm loading a vector of a values and extracting to integer registers with pextrb. (In your code, you'd use pextrw / _mm_extract_epi16). With all the load/store uops happening, doing a vector load and using ALU ops to unpack makes sense. With good L1 hit rates, memory uop throughput may be the bottleneck, not memory / cache latency.
How to optimize histogram statistics with neon intrinsics? some of the same ideas: multiple copies of the bins array. It also has an ARM-specific suggestion for doing address calculations in a SIMD vector (ARM can get two scalars from a vector in a single instruction), and laying out the multiple-bins array the opposite way.

AVX2 Gather instructions for the LUT

If you're going to run this on Intel Skylake, you could maybe even do the LUT lookups with AVX2 gather instructions. (On Broadwell, it's probably a break-even, and on Haswell it would lose; they support vpgatherdd (_mm_i32gather_epi32), but don't have as efficient an implementation. Hopefully Skylake avoids hitting the same cache line multiple times when there is overlap between elements).

And yes, you can still gather from an array of uint16_t (with scale factor = 2), even though the smallest gather granularity is 32-bit elements. It means you get garbage in the high half of each 32-bit vector element instead of 0, but that shouldn't matter. Cache-line splits aren't ideal, since we're probably bottlenecked on cache throughput.

Garbage in the high half of gathered elements doesn't matter because you're extracting only the useful 16 bits anyway with pextrw. (And doing the histogram part of the process with scalar code).

You could potentially use another gather to load from the histogram bins, as long as each element comes from a separate slice/plane of histogram bins. Otherwise, if two elements come from the same bin, it would only be incremented by one when you manually scatter the incremented vector back into the histogram (with scalar stores). This kind of conflict detection for scatter stores is why AVX512CD exists. AVX512 does have scatter instructions, as well as gather (already added in AVX2).

AVX512

See page 50 of Kirill Yukhin's slides from 2014 for an example loop that retries until there are no conflicts; but it doesn't show how get_conflict_free_subset() is implemented in terms of __m512i _mm512_conflict_epi32 (__m512i a) (vpconflictd) (which returns a bitmap in each element of all the preceding elements it conflicts with). As @Mysticial points out, a simple implementation is less simple than it would be if the conflict-detect instruction simply produced a mask-register result, instead of another vector.

I searched for but didn't find an Intel-published tutorial/guide on using AVX512CD, but presumably they think using _mm512_lzcnt_epi32 (vplzcntd) on the result of vpconflictd is useful for some cases, because it's also part of AVX512CD.

Maybe you're "supposed" to do something more clever than just skipping all elements that have any conflicts? Maybe to detect a case where a scalar fallback would be better, e.g. all 16 dword elements have the same index? vpbroadcastmw2d broadcasts a mask register to all 32-bit elements of the result, so that lets you line up a mask-register value with the bitmaps in each element from vpconflictd. (And there are already compare, bitwise, and other operations between elements from AVX512F).

Kirill's slides list VPTESTNM{D,Q} (from AVX512F) along with the conflict-detection instructions. It generates a mask from DEST[j] = (SRC1[i+31:i] BITWISE AND SRC2[i+31:i] == 0)? 1 : 0. i.e. AND elements together, and set the mask result for that element to 1 if they don't intersect.

Possibly also relevant: http://colfaxresearch.com/knl-avx512/ says "For a practical illustration, we construct and optimize a micro-kernel for particle binning particles", with some code for AVX2 (I think). But it's behind a free registration which I haven't done. Based on the diagram, I think they're doing the actual scatter part as scalar, after some vectorized stuff to produce data they want to scatter. The first link says the 2nd link is "for previous instruction sets".

Avoid gather/scatter conflict detection by replicating the count array

When the number of buckets is small compared to the size of the array, it becomes viable to replicate the count arrays and unroll to minimize store-forwarding latency bottlenecks with repeated elements. But for a gather/scatter strategy, it also avoids the possibility of conflicts, solving the correctness problem, if we use a different array for each vector element.

How can we do that when a gather / scatter instruction only takes one array base? Make all the count arrays contiguous, and offset each index vector with one extra SIMD add instruction, fully replacing conflict detection and branching.

If the number of buckets isn't a multiple of 16, you might want to round up the array geometry so each subset of counts starts at an aligned offset. Or not, if cache locality is more important than avoiding unaligned loads in the reduction at the end.

    const size_t nb = 32;  // number of buckets
    const int VEC_WIDTH = 16;  // sizeof(__m512i) / sizeof(uint32_t)
    alignas(__m512i) uint32_t counts[nb * VEC_WIDTH] = {0};

// then in your histo loop

    __m512i idx = ...;  // in this case from LUT lookups
    idx = _mm512_add_epi32(idx, _mm512_setr_epi32(
       0*nb, 1*nb,  2*nb,  3*nb,   4*nb,  5*nb,  6*nb,  7*nb,
       8*nb, 9*nb, 10*nb, 11*nb,  12*nb, 13*nb, 14*nb, 15*nb));
       // note these are C array indexes, not byte offsets
    __m512i vc = _mm512_i32gather_epi32(idx, counts, sizeof(counts[0]));
    vc = _mm512_add_epi32(vc, _mm512_set1_epi32(1));
    _mm512_i32scatter_epi32(counts, idx, vc, sizeof(counts[0]));

https://godbolt.org/z/8Kesx7sEK shows that the above code actually compiles. (Inside a loop, the vector-constant setup could get hoisted, but not setting mask registers to all-one before each gather or scatter, or preparing a zeroed merge destination.)

Then after the main histogram loop, reduce down to one count array:

// Optionally with size_t nb as an arg
// also optionally use restrict if you never reduce in-place, into the bottom of the input.
void reduce_counts(int *output, const int *counts)
{
    for (int i = 0 ; i < nb - (VEC_WIDTH-1) ; i+=VEC_WIDTH) {
        __m512i v = _mm512_load_si512(&counts[i]);  // aligned load, full cache line
        // optional: unroll this and accumulate two vectors in parallel for better spatial locality and more ILP
        for (int offset = nb; offset < nb*VEC_WIDTH ; offset+=nb) {
            __m512i tmp = _mm512_loadu_si512(&counts[i + offset]);
            v = _mm512_add_epi32(v, tmp);
        }
        _mm512_storeu_si512(&output[i], v);
    }

   // if  nb isn't a multiple of the vector width, do some cleanup here
   // possibly using a masked store to write into a final odd-sized destination
}

Obviously this is bad with too many buckets; you end up having to zero way more memory, and loop over a lot of it at the end. Using 256-bit instead of 512-bit gathers helps, you only need half as many arrays, but efficiency of gather/scatter instructions improves with wider vectors. e.g. one vpgatherdd per 5 cycles for 256-bit on Cascade Lake, one per 9.25 for 512-bit. (And both are 4 uops for the front-end)

Or on Ice Lake, one vpscatterdd ymm per 7 cycles, one zmm per 11 cycles. (vs. 14 for 2x ymm). https://uops.info/

In your bins[1000][32] case, you could actually use the later elements of bins[i+0..15] as extra count arrays, if you zero first, at least for the first 1000-15 outer loop iterations. That would avoid touching extra memory: zeroing for the next outer loop would start at the previous counts[32], effectively.

(This would be playing a bit fast and loose with C 2D vs. 1D arrays, but all the actual accesses past the end of the [32] C array type would be via memset (i.e. unsigned char*) or via _mm* intrinsics which are also allowed to alias anything)

Tiny histograms (like 4 buckets) can use count[0] += (arr[i] == 0) and so on, which you can vectorize with SIMD packed compares - Micro Optimization of a 4-bucket histogram of a large array or list This is interesting when the number of buckets is less than or equal to the number of elements in a SIMD vector.

Optimizing SIMD histogram calculation

Like Jester I'm surprised that your SIMD code had any significant improvement. Did you compile the C code with optimization turned on?

The one additional suggestion I can make is to unroll your Packetloop loop. This is a fairly simple optimization and reduces the number of instructions per "iteration" to just two:

pextrb  ebx, xmm0, 0
inc dword [ebx * 4 + Hist]
pextrb  ebx, xmm0, 1
inc dword [ebx * 4 + Hist]
pextrb  ebx, xmm0, 2
inc dword [ebx * 4 + Hist]
...
pextrb  ebx, xmm0, 15
inc dword [ebx * 4 + Hist]

If you're using NASM you can use the %rep directive to save some typing:

%assign pixel 0
%rep 16
    pextrb  rbx, xmm0, pixel
    inc dword [rbx * 4 + Hist]
%assign pixel pixel + 1
%endrep

How to optimize histogram statistics with neon intrinsics?

You can't vectorise the stores directly, but you can pipeline them, and you can vectorise the address calculation on 32-bit platforms (and to a lesser extent on 64-bit platforms).

The first thing you'll want to do, which doesn't actually require NEON to benefit, is to unroll the histogram array so that you can have more data in flight at once:

#define NUM (7*1024*1024)
uint8 src_data[NUM];
uint32 histogram_result[256][4] = {{0}};
for (int i = 0; i < NUM; i += 4)
{
    uint32_t *p0 = &histogram_result[src_data[i + 0]][0];
    uint32_t *p1 = &histogram_result[src_data[i + 1]][1];
    uint32_t *p2 = &histogram_result[src_data[i + 2]][2];
    uint32_t *p3 = &histogram_result[src_data[i + 3]][3];
    uint32_t c0 = *p0;
    uint32_t c1 = *p1;
    uint32_t c2 = *p2;
    uint32_t c3 = *p3;
    *p0 = c0 + 1;
    *p1 = c1 + 1;
    *p2 = c2 + 1;
    *p3 = c3 + 1;
}

for (int i = 0; i < 256; i++)
{
    packed_result[i] = histogram_result[i][0]
                     + histogram_result[i][1]
                     + histogram_result[i][2]
                     + histogram_result[i][3];
}

Note that p0 to p3 can never point to the same address, so reordering their reads and writes is just fine.

From that you can vectorise the calculation of p0 to p3 with intrinsics, and you can vectorise the finalisation loop.

Test it as-is first (because I didn't!). Then you can experiment with structuring the array as result[4][256] instead of result[256][4], or using a smaller or larger unroll factor.

Applying some NEON intrinsics to this:

uint32 histogram_result[256 * 4] = {0};
static const uint16_t offsets[] = { 0x000, 0x001, 0x002, 0x003,
                                    0x000, 0x001, 0x002, 0x003 };
uint16x8_t voffs = vld1q_u16(offsets);
for (int i = 0; i < NUM; i += 8) {
    uint8x8_t p = vld1_u8(&src_data[i]);
    uint16x8_t p16 = vshll_n_u8(p, 16);
    p16 = vaddq_u16(p16, voffs);
    uint32_t c0 = histogram_result[vget_lane_u16(p16, 0)];
    uint32_t c1 = histogram_result[vget_lane_u16(p16, 1)];
    uint32_t c2 = histogram_result[vget_lane_u16(p16, 2)];
    uint32_t c3 = histogram_result[vget_lane_u16(p16, 3)];
    histogram_result[vget_lane_u16(p16, 0)] = c0 + 1;
    c0 = histogram_result[vget_lane_u16(p16, 4)];
    histogram_result[vget_lane_u16(p16, 1)] = c1 + 1;
    c1 = histogram_result[vget_lane_u16(p16, 5)];
    histogram_result[vget_lane_u16(p16, 2)] = c2 + 1;
    c2 = histogram_result[vget_lane_u16(p16, 6)];
    histogram_result[vget_lane_u16(p16, 3)] = c3 + 1;
    c3 = histogram_result[vget_lane_u16(p16, 7)];
    histogram_result[vget_lane_u16(p16, 4)] = c0 + 1;
    histogram_result[vget_lane_u16(p16, 5)] = c1 + 1;
    histogram_result[vget_lane_u16(p16, 6)] = c2 + 1;
    histogram_result[vget_lane_u16(p16, 7)] = c3 + 1;
}

With the histogram array unrolled x8 rather than x4 you might want to use eight scalar accumulators instead of four, but you have to remember that that implies eight count registers and eight address registers, which is more registers than 32-bit ARM has (since you can't use SP and PC).

Unfortunately, with address calculation in the hands of NEON intrinsics, I think the compiler can't safely reason on how it might be able to re-order reads and writes, so you have to reorder them explicitly and hope that you're doing it the best possible way.

Micro Optimization of a 4-bucket histogram of a large array or list

This should be possible at about 8 elements (1 AVX2 vector) per 2.5 clock cycles or so (per core) on a modern x86-64 like Skylake or Zen 2, using AVX2. Or per 2 clocks with unrolling. Or on your Piledriver CPU, maybe 1x 16-byte vector of indexes per 3 clocks with AVX1 _mm_cmpeq_epi32.

The general strategy works with 2 to 8 buckets. And for byte, 16-bit, or 32-bit elements. (So byte elements gives you 32 elements histogrammed per 2 clock cycles best case, with a bit of outer-loop overhead to collect byte counters before they overflow.)

Update: or mapping an int to 1UL << (array[i]*8) to increment one of 4 bytes of a counter with SIMD / SWAR addition, we can go close to 1 clock per vector of 8 int on SKL, or per 2 clocks on Zen2. (This is even more specific to 4 or fewer buckets, and int input, and doesn't scale down to SSE2. It needs variable-shifts or at least AVX1 variable-shuffles.) Using byte elements with the first strategy is probably still better in terms of elements per cycle.

As @JonasH points out, you could have different cores working on different parts of the input array. A single core can come close to saturating memory bandwidth on typical desktops, but many-core Xeons have lower per-core memory bandwidth and higher aggregate, and need more cores to saturate L3 or DRAM bandwidth. Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?

A loop that runs for days at a time.

On a single input list that's very very slow to iterate so it still doesn't overflow int counters? Or repeated calls with different large Lists (like your ~900k test array)?

I believe I want to avoid increasing an index for a list or array as it seem to consume a lot of time?

That's probably because you were benchmarking with optimization disabled. Don't do that, it's not meaningful at all; different code is slowed down different amounts by disabling optimization. More explicit steps and tmp vars can often make slower debug-mode code because there are more things that have to be there to look at with a debugger. But they can just optimize into a normal pointer-increment loop when you compile with normal optimization.

Iterating through an array can compile efficiently into asm.

The slow part is the dependency chain through memory for incrementing a variable index of the array. For example on a Skylake CPU, memory-destination add with the same address repeatedly bottlenecks at about one increment per 6 clock cycles because the next add has to wait to load the value stored by the previous one. (Store-forwarding from the store buffer means it doesn't have to wait for it to commit to cache first, but it's still much slower than add into a register.) See also Agner Fog's optimization guides: https://agner.org/optimize/

With counts only distributed across 4 buckets, you'll have a lot of cases where instructions are waiting to reload the data stored by another recent instruction, so you can't even achieve the nearly 1 element per clock cycle you might if counts were well distributed over more counters that still were all hot in L1d cache.

One good solution to this problem is unrolling the loop with multiple arrays of counters. Methods to vectorise histogram in SIMD?. Like instead of int[] indexes = { 0, 0, 0, 0 }; you can make it a 2D array of four counters each. You'd have to manually unroll the loop in the source to iterate over the input array, and handle the last 0..3 left over elements after the unrolled part.

This is a good technique for small to medium arrays of counts, but becomes bad if replicating the counters starts to lead to cache misses.

Use narrow integers to save cache footprint / mem bandwidth.

Another thing you can/should do is use as narrow a type as possible for your arrays of 0..3 values: each number can fit in a byte so using 8-bit integers would save you a factor of 4 cache footprint / memory bandwidth.

x86 can efficiently load/store bytes into to/from full registers. With SSE4.1 you also have SIMD pmovzxbd to make it more efficient to auto-vectorize when you have a byte_array[i] used with an int_array[i] in a loop.

(When I say x86 I mean including x86-64, as opposed to ARM or PowerPC. Of course you don't actually want to compile 32-bit code, what Microsoft calls "x86")

With very small numbers of buckets, like 4

This looks like a job for SIMD compares. With x86 SSE2 the number of int elements per 16-byte vector of data is equal to your number of histogram bins.

You already had a SIMD sort of idea with trying to treat a number as four separate byte elements. See https://en.wikipedia.org/wiki/SIMD#Software

But 00_01_10_11 is just source-level syntax for human-readable separators in numbers, and double is a floating-point type whose internal representation isn't the same as for integers. And you definitely don't want to use strings; SIMD lets you do stuff like operating on 4 elements of an integer array at once.

The best way I can see to approach this is to separately count matches for each of the 4 values, rather than map elements to counters. We want to process multiple elements in parallel but mapping them to counters can have collisions when there are repeated values in one vector of elements. You'd need to increment that counter twice.

The scalar equivalent of this is:

int counts[4] = {0,0,0,0};
for () {
    counts[0] += (arr[i] == 0);
    counts[1] += (arr[i] == 1);
    counts[2] += (arr[i] == 2);  // count matches
  //counts[3] += (arr[i] == 3);  // we assume any that aren't 0..2 are this
}
counts[3] = size - counts[0] - counts[1] - counts[2];
// calculate count 3 from other counts

which (in C++) GCC -O3 will actually auto-vectorize exactly the way I did manually below: https://godbolt.org/z/UJfzuH. Clang even unrolls it when auto-vectorizing, so it should be better than my hand-vectorized version for int inputs. Still not as good as the alternate vpermilps strategy for that case, though.

(And you do still need to manually vectorize if you want byte elements with efficient narrow sums, only widening in an outer loop.)

With byte elements, see How to count character occurrences using SIMD. The element size is too narrow for a counter; it would overflow after 256 counts. So you have to widen either in the inner loop, or use nested loops to do some accumulating before widening.

I don't know C#, so I could write the code in x86 assembly or in C++ with intrinsics. Perhaps C++ intrinsics is more useful for you. C# has some kind of vector extensions that should make it possible to port this.

This is C++ for x86-64, using AVX2 SIMD intrinsics. See https://stackoverflow.com/tags/sse/info for some info.

// Manually vectorized for AVX2, for int element size
// Going nearly 4x as fast should be possible for byte element size

#include <immintrin.h>

void count_elements_avx2(const std::vector<int> &input,  unsigned output_counts[4])
{
    __m256i  counts[4] = { _mm256_setzero_si256() };  // 4 vectors of zeroed counters
                  // each vector holds counts for one bucket, to be hsummed at the end

    size_t size = input.size();
    for(size_t i = 0 ; i<size ; i+=8) {  // 8x 32-bit elements per vector
        __m256i v = _mm256_loadu_si256((const __m256i*)&input[i]);  // unaligned load of 8 ints
        for (int val = 0 ; val < 3; val++) {
           // C++ compilers will unroll this with 3 vector constants and no memory access
            __m256i match = _mm256_cmpeq_epi32(v, _mm256_set1_epi32(val));  // 0 or all-ones aka -1
            counts[val] = _mm256_sub_epi32(counts[val], match);   // x -= -1 or 0 conditional increment
        }
    }

    // transpose and sum 4 vectors of 8 elements down to 1 vector of 4 elements
    __m128i summed_counts = hsum_xpose(counts);   // helper function defined in Godbolt link
    _mm_storeu_si128((__m128i*)output_counts, summed_counts);

    output_counts[3] = size - output_counts[0]
                       - output_counts[1] - output_counts[2];

    // TODO: handle the last size%8 input elements; scalar would be easy
}

This compiles nicely with clang (on the Godbolt compiler explorer). Presumably you can write C# that compiles to similar machine code. If not, consider calling native code from a C++ compiler (or hand-written in asm if you can't get truly optimal code from the compiler). If your real use-case runs as many iterations as your benchmark, that could amortize the extra overhead if the input array doesn't have to get copied.

 # from an earlier version of the C++, doing all 4 compares in the inner loop
 # clang -O3 -march=skylake
.LBB0_2:                                     # do {
    vmovdqu ymm7, ymmword ptr [rcx + 4*rdx]    # v = load arr[i + 0..7]
    vpcmpeqd        ymm8, ymm7, ymm3           # compare v == 0
    vpsubd  ymm4, ymm4, ymm8                   # total0 -= cmp_result
    vpcmpeqd        ymm8, ymm7, ymm5
    vpsubd  ymm2, ymm2, ymm8
    vpcmpeqd        ymm7, ymm7, ymm6           # compare v == 2
    vpsubd  ymm1, ymm1, ymm7                   # total2 -= cmp_result
    add     rdx, 8                             # i += 8
    cmp     rdx, rax
    jb      .LBB0_2                          # }while(i < size)

Estimated best-case Skylake performance: ~2.5 cycles per vector (8 int or 32 int8_t)

Or 2 with unrolling.

Without AVX2, using only SSE2, you'd have some extra movdqa instructions and only be doing 4 elements per vector. This would still be a win vs. scalar histogram in memory, though. Even 1 element / clock is nice, and should be doable with SSE2 that can run on any x86-64 CPU.

Assuming no cache misses of course, with hardware prefetch into L1d staying ahead of the loop. This might only happen with the data already hot in L2 cache at least. I'm also assuming no stalls from memory alignment; ideally your data is aligned by 32 bytes. If it usually isn't, possibly worth processing the first unaligned part and then using aligned loads, if the array is large enough.

For byte elements, the inner-most loop will look similar (with vpcmpeqb and vpsubb but run only at most 255 (not 256) iterations before hsumming to 64-bit counters, to avoid overflow. So throughput per vector will be the same, but with 4x as many elements per vector.

See https://agner.org/optimize/ and https://uops.info/ for performance analysis details. e.g. vpcmpeqd on uops.info

The inner loop is only 9 fused-domain uops for Haswell/Skylake, so best case front-end bottleneck of about 1 iteration per 2.25 cycles (the pipeline is 4 uops wide). Small-loop effects get in the way somewhat: Is performance reduced when executing loops whose uop count is not a multiple of processor width? - Skylake has its loop buffer disabled by a microcode update for an erratum, but even before that a 9 uop loop ended up issuing slightly worse than one iter per 2.25 cycles on average, let's say 2.5 cycles.

Skylake runs vpsubd on ports 0,1, or 5, and runs vpcmpeqd on ports 0 or 1. So the back-end bottleneck on ports 0,1,5 is 6 vector ALU uops for 3 ports, or 1 iteration per 2 cycles. So the front-end bottleneck dominates. (Ice Lake's wider front-end may let it bottleneck on the back-end even without unrolling; same back-end throughputs there unless you use AVX512...)

If clang had indexed from the end of the array and counted the index up towards zero it could have saved a uop for a total of 8 uops = one iter per 2 cycles in the front-end, matching the back-end bottleneck.

How to Speed Up This Histogram of Lut Lookups