How to Implement Atoi Using Simd

How to implement atoi using SIMD?

The algorithm and its implementation is finished now. It's complete and (moderately) tested (Updated for less constant memory usage and tolerating plus-char).

The properties of this code are as follows:

Works for int and uint,
from MIN_INT=-2147483648 to MAX_INT=2147483647 and
from MIN_UINT=0 to MAX_UINT=4294967295
A leading '-' char indicates a negative number (as sensible), a leading '+' char is ignored
Leading zeros (with or without sign char) are ignored
Overflow is ignored - bigger numbers just wraparound
Zero length strings result in value 0 = -0
Invalid characters are recognized and the conversion ends at the first invalid char
At least 16 bytes after the last leading zero must be accessible and possible security implications of reading after EOS are left to the caller
Only SSE4.2 is needed

About this implementation:

This code sample can be run with GNU Assembler(as) using .intel_syntax noprefix at the beginning
Data footprint for constants is 64 bytes (4*128 bit XMM) equalling one cache line.
Code footprint is 46 instructions with 51 micro-Ops and 64 cycles latency
One loop for removal of leading zeros, otherwise no jumps except for error handling, so...
Time complexity is O(1)

The approach of the algorithm:

- Pointer to number string is expected in ESI
- Check if first char is '-', then indicate if negative number in EDX (**A**)
- Check for leading zeros and EOS (**B**)
- Check string for valid digits and get strlen() of valid chars (**C**)
- Reverse string so that power of 
  10^0 is always at BYTE 15
  10^1 is always at BYTE 14
  10^2 is always at BYTE 13
  10^3 is always at BYTE 12
  10^4 is always at BYTE 11 
  ... 
  and mask out all following chars (**D**)
- Subtract saturated '0' from each of the 16 possible chars (**1**)
- Take 16 consecutive byte-values and and split them to WORDs 
  in two XMM-registers (**2**)
  P O N M L K J I  | H G F E D C B A ->
    H   G   F   E  |   D   C   B   A (XMM0)
    P   O   N   M  |   L   K   J   I (XMM1)
- Multiply each WORD by its place-value modulo 10000 (1,10,100,1000)
  (factors smaller then MAX_WORD, 4 factors per QWORD/halfXMM)
  (**2**) so we can horizontally combine twice before another multiply.
  The PMADDWD instruction can do this and the next step:
- Horizontally add adjacent WORDs to DWORDs (**3**)
  H*1000+G*100  F*10+E*1  |  D*1000+C*100  B*10+A*1 (XMM0)
  P*1000+O*100  N*10+M*1  |  L*1000+K*100  J*10+I*1 (XMM1)
- Horizontally add adjacent DWORDs from XMM0 and XMM1 to XMM0 (**4**)
  xmmDst[31-0]   = xmm0[63-32]  + xmm0[31-0]
  xmmDst[63-32]  = xmm0[127-96] + xmm0[95-64]
  xmmDst[95-64]  = xmm1[63-32]  + xmm1[31-0]
  xmmDst[127-96] = xmm1[127-96] + xmm1[95-64]
- Values in XMM0 are multiplied with the factors (**5**)
  P*1000+O*100+N*10+M*1 (DWORD factor 1000000000000 = too big for DWORD, but possibly useful for QWORD number strings)
  L*1000+K*100+J*10+I*1 (DWORD factor 100000000)
  H*1000+G*100+F*10+E*1 (DWORD factor 10000)
  D*1000+C*100+B*10+A*1 (DWORD factor 1)
- The last step is adding these four DWORDs together with 2*PHADDD emulated by 2*(PSHUFD+PADDD)
  - xmm0[31-0]  = xmm0[63-32]  + xmm0[31-0]   (**6**)
    xmm0[63-32] = xmm0[127-96] + xmm0[95-64]
      (the upper QWORD contains the same and is ignored)
  - xmm0[31-0]  = xmm0[63-32]  + xmm0[31-0]   (**7**)
- If the number is negative (indicated in EDX by 000...0=pos or 111...1=neg), negate it(**8**)

And the sample implementation in GNU Assembler with intel syntax:

.intel_syntax noprefix
.data
  .align 64
    ddqDigitRange: .byte  '0','9',0,0,0,0,0,0,0,0,0,0,0,0,0,0
    ddqShuffleMask:.byte  15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0 
    ddqFactor1:    .word  1,10,100,1000, 1,10,100,1000  
    ddqFactor2:    .long  1,10000,100000000,0
.text    
_start:
   mov   esi, lpInputNumberString
   /* (**A**) indicate negative number in EDX */
   mov   eax, -1
   xor   ecx, ecx
   xor   edx, edx
   mov   bl,  byte ptr [esi]
   cmp   bl,  '-'
   cmove edx, eax
   cmp   bl,  '+'
   cmove ecx, eax
   sub   esi, edx
   sub   esi, ecx
   /* (**B**)remove leading zeros */
   xor   eax,eax               /* return value ZERO */
  remove_leading_zeros:
   inc   esi
   cmp   byte ptr [esi-1], '0'  /* skip leading zeros */
  je remove_leading_zeros
   cmp   byte ptr [esi-1], 0    /* catch empty string/number */
  je FINISH
   dec   esi
   /* check for valid digit-chars and invert from front to back */
   pxor      xmm2, xmm2         
   movdqa    xmm0, xmmword ptr [ddqDigitRange]
   movdqu    xmm1, xmmword ptr [esi]
   pcmpistri xmm0, xmm1, 0b00010100 /* (**C**) iim8=Unsigned bytes, Ranges, Negative Polarity(-), returns strlen() in ECX */
  jo FINISH             /* if first char is invalid return 0 - prevent processing empty string - 0 is still in EAX */
   mov al , '0'         /* value to subtract from chars */
   sub ecx, 16          /* len-16=negative to zero for shuffle mask */
   movd      xmm0, ecx
   pshufb    xmm0, xmm2 /* broadcast CL to all 16 BYTEs */
   paddb     xmm0, xmmword ptr [ddqShuffleMask] /* Generate permute mask for PSHUFB - all bytes < 0 have highest bit set means place gets zeroed */
   pshufb    xmm1, xmm0 /* (**D**) permute - now from highest to lowest BYTE are factors 10^0, 10^1, 10^2, ... */
   movd      xmm0, eax                         /* AL='0' from above */
   pshufb    xmm0, xmm2                        /* broadcast AL to XMM0 */
   psubusb   xmm1, xmm0                        /* (**1**) */
   movdqa    xmm0, xmm1
   punpcklbw xmm0, xmm2                        /* (**2**) */
   punpckhbw xmm1, xmm2
   pmaddwd   xmm0, xmmword ptr [ddqFactor1]    /* (**3**) */
   pmaddwd   xmm1, xmmword ptr [ddqFactor1]
   phaddd    xmm0, xmm1                        /* (**4**) */
   pmulld    xmm0, xmmword ptr [ddqFactor2]    /* (**5**) */
   pshufd    xmm1, xmm0, 0b11101110            /* (**6**) */
   paddd     xmm0, xmm1
   pshufd    xmm1, xmm0, 0b01010101            /* (**7**) */
   paddd     xmm0, xmm1
   movd      eax, xmm0
   /* negate if negative number */              
   add       eax, edx                          /* (**8**) */
   xor       eax, edx
  FINISH:
   /* EAX is return (u)int value */

The result of Intel-IACA Throughput Analysis for Haswell 32-bit:

Throughput Analysis Report
--------------------------
Block Throughput: 16.10 Cycles       Throughput Bottleneck: InterIteration

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 9.5    0.0  | 10.0 | 4.5    4.5  | 4.5    4.5  | 0.0  | 11.1 | 11.4 | 0.0  |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   0*   |           |     |           |           |     |     |     |     |    | xor eax, eax
|   0*   |           |     |           |           |     |     |     |     |    | xor ecx, ecx
|   0*   |           |     |           |           |     |     |     |     |    | xor edx, edx
|   1    |           | 0.1 |           |           |     |     | 0.9 |     |    | dec eax
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     | CP | mov bl, byte ptr [esi]
|   1    |           |     |           |           |     |     | 1.0 |     | CP | cmp bl, 0x2d
|   2    | 0.1       | 0.2 |           |           |     |     | 1.8 |     | CP | cmovz edx, eax
|   1    | 0.1       | 0.5 |           |           |     |     | 0.4 |     | CP | cmp bl, 0x2b
|   2    | 0.5       | 0.2 |           |           |     |     | 1.2 |     | CP | cmovz ecx, eax
|   1    | 0.2       | 0.5 |           |           |     |     | 0.2 |     | CP | sub esi, edx
|   1    | 0.2       | 0.5 |           |           |     |     | 0.3 |     | CP | sub esi, ecx
|   0*   |           |     |           |           |     |     |     |     |    | xor eax, eax
|   1    | 0.3       | 0.1 |           |           |     |     | 0.6 |     | CP | inc esi
|   2^   | 0.3       |     | 0.5   0.5 | 0.5   0.5 |     |     | 0.6 |     |    | cmp byte ptr [esi-0x1], 0x30
|   0F   |           |     |           |           |     |     |     |     |    | jz 0xfffffffb
|   2^   | 0.6       |     | 0.5   0.5 | 0.5   0.5 |     |     | 0.4 |     |    | cmp byte ptr [esi-0x1], 0x0
|   0F   |           |     |           |           |     |     |     |     |    | jz 0x8b
|   1    | 0.1       | 0.9 |           |           |     |     |     |     | CP | dec esi
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | movdqa xmm0, xmmword ptr [0x80492f0]
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     | CP | movdqu xmm1, xmmword ptr [esi]
|   0*   |           |     |           |           |     |     |     |     |    | pxor xmm2, xmm2
|   3    | 2.0       | 1.0 |           |           |     |     |     |     | CP | pcmpistri xmm0, xmm1, 0x14
|   1    |           |     |           |           |     |     | 1.0 |     |    | jo 0x6e
|   1    |           | 0.4 |           |           |     | 0.1 | 0.5 |     |    | mov al, 0x30
|   1    | 0.1       | 0.5 |           |           |     | 0.1 | 0.3 |     | CP | sub ecx, 0x10
|   1    |           |     |           |           |     | 1.0 |     |     | CP | movd xmm0, ecx
|   1    |           |     |           |           |     | 1.0 |     |     | CP | pshufb xmm0, xmm2
|   2^   |           | 1.0 | 0.5   0.5 | 0.5   0.5 |     |     |     |     | CP | paddb xmm0, xmmword ptr [0x80492c0]
|   1    |           |     |           |           |     | 1.0 |     |     | CP | pshufb xmm1, xmm0
|   1    |           |     |           |           |     | 1.0 |     |     |    | movd xmm0, eax
|   1    |           |     |           |           |     | 1.0 |     |     |    | pshufb xmm0, xmm2
|   1    |           | 1.0 |           |           |     |     |     |     | CP | psubusb xmm1, xmm0
|   0*   |           |     |           |           |     |     |     |     | CP | movdqa xmm0, xmm1
|   1    |           |     |           |           |     | 1.0 |     |     | CP | punpcklbw xmm0, xmm2
|   1    |           |     |           |           |     | 1.0 |     |     |    | punpckhbw xmm1, xmm2
|   2^   | 1.0       |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     | CP | pmaddwd xmm0, xmmword ptr [0x80492d0]
|   2^   | 1.0       |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | pmaddwd xmm1, xmmword ptr [0x80492d0]
|   3    |           | 1.0 |           |           |     | 2.0 |     |     | CP | phaddd xmm0, xmm1
|   3^   | 2.0       |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     | CP | pmulld xmm0, xmmword ptr [0x80492e0]
|   1    |           |     |           |           |     | 1.0 |     |     | CP | pshufd xmm1, xmm0, 0xee
|   1    |           | 1.0 |           |           |     |     |     |     | CP | paddd xmm0, xmm1
|   1    |           |     |           |           |     | 1.0 |     |     | CP | pshufd xmm1, xmm0, 0x55
|   1    |           | 1.0 |           |           |     |     |     |     | CP | paddd xmm0, xmm1
|   1    | 1.0       |     |           |           |     |     |     |     | CP | movd eax, xmm0
|   1    |           |     |           |           |     |     | 1.0 |     | CP | add eax, edx
|   1    |           |     |           |           |     |     | 1.0 |     | CP | xor eax, edx
Total Num Of Uops: 51

The result of Intel-IACA Latency Analysis for Haswell 32-bit:

Latency Analysis Report
---------------------------
Latency: 64 Cycles

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - Intel(R) AVX to Intel(R) SSE code switch, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

The Resource delay is counted since all the sources of the instructions are ready
and until the needed resource becomes available

| Inst |                 Resource Delay In Cycles                  |    |
| Num  | 0  - DV | 1  | 2  - D  | 3  - D  | 4  | 5  | 6  | 7  | FE |    |
-------------------------------------------------------------------------
|  0   |         |    |         |         |    |    |    |    |    |    | xor eax, eax
|  1   |         |    |         |         |    |    |    |    |    |    | xor ecx, ecx
|  2   |         |    |         |         |    |    |    |    |    |    | xor edx, edx
|  3   |         |    |         |         |    |    |    |    |    |    | dec eax
|  4   |         |    |         |         |    |    |    |    | 1  | CP | mov bl, byte ptr [esi]
|  5   |         |    |         |         |    |    |    |    |    | CP | cmp bl, 0x2d
|  6   |         |    |         |         |    |    |    |    |    | CP | cmovz edx, eax
|  7   |         |    |         |         |    |    |    |    |    | CP | cmp bl, 0x2b
|  8   |         |    |         |         |    |    | 1  |    |    | CP | cmovz ecx, eax
|  9   |         |    |         |         |    |    |    |    |    | CP | sub esi, edx
| 10   |         |    |         |         |    |    |    |    |    | CP | sub esi, ecx
| 11   |         |    |         |         |    |    |    |    | 3  |    | xor eax, eax
| 12   |         |    |         |         |    |    |    |    |    | CP | inc esi
| 13   |         |    |         |         |    |    |    |    |    |    | cmp byte ptr [esi-0x1], 0x30
| 14   |         |    |         |         |    |    |    |    |    |    | jz 0xfffffffb
| 15   |         |    |         |         |    |    |    |    |    |    | cmp byte ptr [esi-0x1], 0x0
| 16   |         |    |         |         |    |    |    |    |    |    | jz 0x8b
| 17   |         |    |         |         |    |    |    |    |    | CP | dec esi
| 18   |         |    |         |         |    |    |    |    | 4  |    | movdqa xmm0, xmmword ptr [0x80492f0]
| 19   |         |    |         |         |    |    |    |    |    | CP | movdqu xmm1, xmmword ptr [esi]
| 20   |         |    |         |         |    |    |    |    | 5  |    | pxor xmm2, xmm2
| 21   |         |    |         |         |    |    |    |    |    | CP | pcmpistri xmm0, xmm1, 0x14
| 22   |         |    |         |         |    |    |    |    |    |    | jo 0x6e
| 23   |         |    |         |         |    |    |    |    | 6  |    | mov al, 0x30
| 24   |         |    |         |         |    |    |    |    |    | CP | sub ecx, 0x10
| 25   |         |    |         |         |    |    |    |    |    | CP | movd xmm0, ecx
| 26   |         |    |         |         |    |    |    |    |    | CP | pshufb xmm0, xmm2
| 27   |         |    |         |         |    |    |    |    | 7  | CP | paddb xmm0, xmmword ptr [0x80492c0]
| 28   |         |    |         |         |    |    |    |    |    | CP | pshufb xmm1, xmm0
| 29   |         |    |         |         |    | 1  |    |    |    |    | movd xmm0, eax
| 30   |         |    |         |         |    | 1  |    |    |    |    | pshufb xmm0, xmm2
| 31   |         |    |         |         |    |    |    |    |    | CP | psubusb xmm1, xmm0
| 32   |         |    |         |         |    |    |    |    |    | CP | movdqa xmm0, xmm1
| 33   |         |    |         |         |    |    |    |    |    | CP | punpcklbw xmm0, xmm2
| 34   |         |    |         |         |    |    |    |    |    |    | punpckhbw xmm1, xmm2
| 35   |         |    |         |         |    |    |    |    | 9  | CP | pmaddwd xmm0, xmmword ptr [0x80492d0]
| 36   |         |    |         |         |    |    |    |    | 9  |    | pmaddwd xmm1, xmmword ptr [0x80492d0]
| 37   |         |    |         |         |    |    |    |    |    | CP | phaddd xmm0, xmm1
| 38   |         |    |         |         |    |    |    |    | 10 | CP | pmulld xmm0, xmmword ptr [0x80492e0]
| 39   |         |    |         |         |    |    |    |    |    | CP | pshufd xmm1, xmm0, 0xee
| 40   |         |    |         |         |    |    |    |    |    | CP | paddd xmm0, xmm1
| 41   |         |    |         |         |    |    |    |    |    | CP | pshufd xmm1, xmm0, 0x55
| 42   |         |    |         |         |    |    |    |    |    | CP | paddd xmm0, xmm1
| 43   |         |    |         |         |    |    |    |    |    | CP | movd eax, xmm0
| 44   |         |    |         |         |    |    |    |    |    | CP | add eax, edx
| 45   |         |    |         |         |    |    |    |    |    | CP | xor eax, edx

Resource Conflict on Critical Paths: 
-----------------------------------------------------------------
|  Port  | 0  - DV | 1  | 2  - D  | 3  - D  | 4  | 5  | 6  | 7  |
-----------------------------------------------------------------
| Cycles | 0    0  | 0  | 0    0  | 0    0  | 0  | 0  | 1  | 0  |
-----------------------------------------------------------------

List Of Delays On Critical Paths
-------------------------------
6 --> 8 1 Cycles Delay On Port6

An alternative handling suggested in comments by Peter Cordes is replacing the last two add+xor instructions by an imul. This concentration of OpCodes is likely to be superior. Unfortunately IACA doesn't support that instruction and throws a ! - instruction not supported, was not accounted in Analysis comment. Nevertheless, although I like the reduction of OpCodes and reduction from (2uops, 2c latency) to (1 uop, 3c latency - "worse latency, but still one m-op on AMD"), I prefer to leave it to the implementer which way to choose. I haven't checked if the following code is sufficient for parsing any number. It is just mentioned for completeness and code modifications in other parts may be necessary (especially handling positive numbers).

The alternative may be replacing the last two lines with:

  ...
  /* negate if negative number */              
   imul eax, edx
  FINISH:
  /* EAX is return (u)int value */

C++ Fastest numerical string to long parsing

You might want to have a look at loop unrolling.
When the body of a loop is short enough, checking the loop condition every iteration might be relatively expensive.

A specific and interesting way of implementing loop unrolling is called Duff's device: https://en.wikipedia.org/wiki/Duff%27s_device

Here's the version for your function:

inline uint64_t strtol_duff(char* s, int len)
{
    uint64_t val = 0;
    int n = (len + 7) / 8;
    int i = 0;
    switch (len % 8) {
    case 0: do {
                 val = val * 10 + (*(s + i++) - '0');
    case 7:      val = val * 10 + (*(s + i++) - '0');
    case 6:      val = val * 10 + (*(s + i++) - '0');
    case 5:      val = val * 10 + (*(s + i++) - '0');
    case 4:      val = val * 10 + (*(s + i++) - '0');
    case 3:      val = val * 10 + (*(s + i++) - '0');
    case 2:      val = val * 10 + (*(s + i++) - '0');
    case 1:      val = val * 10 + (*(s + i++) - '0');
    } while (--n > 0);
    }
    return val;
};

To be honest, in your case I believe you will not see a huge benefit because the loop's body is not that tiny. It's all very much system dependent and requires experimentation (like most optimizations).
Good compiler optimizers might unroll the loop automatically if it is actually beneficial.

But it's worth to try.

Are there any problems for which SIMD outperforms Cray-style vectors?

The "traditional" vector approaches (Cray, CDC/ETA, NEC, etc) arose in an era (~1976 to ~1992) with limited transistor budgets and commercially available low-latency SRAM main memories. In this technology regime, processors did not have the transistor budget to implement the full scoreboarding and interlocking for out-of-order operations that is currently available to allow pipelining of multi-cycle floating-point operations. Instead, a vector instruction set was created. Vector arithmetic instructions guaranteed that successive operations within the vector were independent and could be pipelined. It was relatively easy to extend the hardware to allow multiple vector operations in parallel, since the dependency checking only needed to be done "per vector" instead of "per element".

The Cray ISA was RISC-like in that data was loaded from memory into vector registers, arithmetic was performed register-to-register, then results were stored from vector registers back to memory. The maximum vector length was initially 64 elements, later 128 elements.

The CDC/ETA systems used a "memory-to-memory" architecture, with arithmetic instructions specifying memory locations for all inputs and outputs, along with a vector length of 1 to 65535 elements.

None of the "traditional" vector machines used data caches for vector operations, so performance was limited by the rate at which data could be loaded from memory. The SRAM main memories were a major fraction of the cost of the systems. In the early 1990's SRAM cost/bit was only about 2x that of DRAM, but DRAM prices dropped so rapidly that by 2002 SRAM price/MiB was 75x that of DRAM -- no longer even remotely acceptable.

The SRAM memories of the traditional machines were word-addressable (64-bit words) and were very heavily banked to allow nearly full speed for linear, strided (as long as powers of two were avoided), and random accesses. This led to a programming style that made extensive use of non-unit-stride memory access patterns. These access patterns cause performance problems on cached machines, and over time developers using cached systems quit using them -- so codes were less able to exploit this capability of the vector systems.
As codes were being re-written to use cached systems, it slowly became clear that caches work quite well for the majority of the applications that had been running on the vector machines. Re-use of cached data decreased the amount of memory bandwidth required, so applications ran much better on the microprocessor-based systems than expected from the main memory bandwidth ratios.

By the late 1990's, the market for traditional vector machines was nearly gone, with workloads transitioned primarily to shared-memory machines using RISC processors and multi-level cache hierarchies. A few government-subsidized vector systems were developed (especially in Japan), but these had little impact on high performance computing, and none on computing in general.

The story is not over -- after many not-very-successful tries (by several vendors) at getting vectors and caches to work well together, NEC has developed a very interesting system (NEC SX-Aurora Tsubasa) that combines a multicore vector register processor design with DRAM (HBM) main memory, and an effective shared cache. I especially like the ability to generate over 300 GB/s of memory bandwidth using a single thread of execution -- this is 10x-25x the bandwidth available with a single thread with AMD or Intel processors.

So the answer is that the low cost of microprocessors with cached memory drove vector machines out of the marketplace even before SIMD was included. SIMD had clear advantages for certain specialized operations, and has become more general over time -- albeit with diminishing benefits as the SIMD width is increased. The vector approach is not dead in an architectural sense (e.g., the NEC Vector Engine), but its advantages are generally considered to be overwhelmed by the disadvantages of software incompatibility with the dominant architectural model.

implement SIMD in C++

Your question represents some confusion on what is going on. The i,j,k variables are almost certainly held in registers already, assuming you are compiling with optimizations on (which you should do - add "-O2" to your icc invocation).

You can use an asm block, but an easier method considering you're already using ICC is to use the SSE intrinsics. Intel's documentation for them is here - http://www.intel.com/software/products/compilers/clin/docs/ug_cpp/comm1019.htm

It looks like you can SIMD-ize the top-level loop, though it's going to depend greatly on what your delta function is.

Find the first instance of a character using simd

You have the right idea with _mm256_cmpeq_epi8 -> _mm256_movemask_epi8. AFAIK, that's the optimal way to implement this for Intel CPUs at least. PMOVMSKB r32, ymm is the same speed as the XMM 16-byte version, so it would be a huge loss to unpack the two lanes of a 256b vector and movemask them separately and then recombine the integer results. (Source: Agner Fog's instruction table. See other perf links in the x86 tag wiki.)

Make the code inside the loop as efficient as possible by leaving the ffs until after you've identified a non-zero result from _mm256_movemask_epi8.

TEST/JCC can macro fuse into a single uop, but BSF/JCC doesn't, so it takes an extra instruction. (And you'd be hard-pressed to get a C compiler to emit BSF/JCC anyway. More likely branching on the result of ffs would give you some kind of test for the input being non-zero, then BSF, then add 1, then compare-and-branch. That's obviously horrible compared to just testing the movemask result.)

Also note that for similar problems, comparing the movemask (e.g. to check that it's 0xFFFFFFFF) is just as efficient as branching on it being non-zero.

As Paul R suggested, looking at some strlen, strchr, and memchr implementations may be informative. There are multiple hand-written asm implementations in open-source libc implementations, and other places. (e.g. glibc, and Agner Fog's asmlib.)

Many of glibc's versions scan up to an alignment boundary, then use an unrolled loop that reads 64B at a time (in 4 SSE vectors, since I don't think glibc has an AVX2 version).

To optimize for long strings, reduce overhead from testing the compare results by ORing the compare results together, and check that. If you find a hit, go back and re-test your vectors to see which vector had the hit.

It may be somewhat more efficient to do the ffs on one 64-bit integer that you built up out of multiple movemask results (with shift and |). I'm not sure about doing this inside the loop before testing for zero; I don't remember if one of glibc's strlen strategies did that or not.

Everything I've suggested here is stuff can be seen in asm in various glibc strategies for strlen, memchr, and related functions. Here's sysdeps/x86_64/strlen.S, but I there may be another source file somewhere using more than baseline SSE2. (Or not, I might be thinking of a different function, maybe there's nothing to be gained beyond SSE2, until AVX (3-operand insns) and AVX2 (256b integer vectors).

Is there an SIMD instruction to achieve batch array memory index mapping?

There are no byte or word gather instructions in AVX2 / AVX512, and no gathers at all in NEON. The DWORD gathers that do exist are much slower than a multiply! e.g. one per 5 cycle throughput for vpgatherdd ymm,[reg + scale*ymm], ymm, according to Agner Fog's instruction table for Skylake.

You can use shuffles as a parallel table-lookup. But your table for each lookup is 256 16-bit words. That's 512 bytes. AVX512 has some shuffles that select from the concatenation of 2 registers, but that's "only" 2x 64 bytes, and the byte or word element-size versions of those are multiple uops on current CPUs. (e.g. AVX512BW vpermi2w). They are still fantastically powerful compared to vpshufb, though.

So using a shuffle as a LUT won't work in your case, but it does work very well for some cases, e.g. for popcount you can split bytes into 4-bit nibbles and use vpshufb to do 32 lookups in parallel from a 16-element table of bytes.

Normally for SIMD you want to replace table lookups with computation, because computation is much more SIMD friendly.

Suck it up and use pmullw / _mm_mullo_epi16. You have instruction-level parallelism, and Skylake has 2 per clock throughput for 16-bit SIMD multiply (but 5 cycle latency). For image processing, normally throughput matters more than latency, as long as you keep the latency within reason so out-of-order execution can hide it.

If your multipliers ever have few enough 1 bits in their binary representation, you could consider using shift/add instead of an actual multiply. e.g. B * 29 = B * 32 - B - B * 2. Or B<<5 - B<<1 - B. That many instructions probably has more throughput cost than a single multiply, though. If you could do it with just 2 terms, it might be worth it. (But then again, still maybe not, depending on the CPU. Total instruction throughput and vector ALU bottlenecks are a big deal.)

How to Implement Atoi Using Simd