Rgba to Abgr: Inline Arm Neon Asm for iOS/Xcode

RGBA to ABGR: Inline arm neon asm for iOS/Xcode

As stated in the edits to the original question, it turned out that I needed a different assembly implementation for arm64 and armv7.

#ifdef __ARM_NEON
  #if __LP64__
asm volatile("ldr q0, [%0], #16  \n"
             "rev32.16b v0, v0   \n"
             "str q0, [%1], #16  \n"
             : "=r"(src), "=r"(dst)
             : "r"(src), "r"(dst)
             : "d0", "d1"
             );
  #else
asm volatile("vld1.32 {d0, d1}, [%0]! \n"
             "vrev32.8 q0, q0         \n"
             "vst1.32 {d0, d1}, [%1]! \n"
             : "=r"(src), "=r"(dst)
             : "r"(src), "r"(dst)
             : "d0", "d1"
             );
  #endif
#else

The intrinsics code that I posted in the original post generated surprisingly good assembly though, and also generated the arm64 version for me, so it may be a better idea to use intrinsics instead in the future.

inline asm compiled in XCode for Simulator but failed to compile for Device

You could try the ARM clz instruction to replace bsr. I don't know any good replacements for the other two.

Edit: OP clarified some context.

You need to get the Intel® 64 and IA-32 Architectures Software Developer's Manuals. They contain a full instruction set reference which will help you out. Even pseudocode for the bsf/bsr instructions is in there, and can be easily translated into their C equivalents:

int Bsf(uint32_t n) {
{
    int m;

    for (m = 0; m < 32; m++)
    {
        if (n & (1 << m))
            return m;
    }

    return 32;
}

int Bsr(uint32_t n) {
{
    int m;

    for (m = 31; m >= 0; m--)
    {
        if (n & (1 << m))
            return m;
    }

    return 32;
}

The rdtsc instruction reads the processor's time-stamp counter, which is a 64-bit value incremented every clock cycle:

The processor monotonically increments the time-stamp counter MSR every clock cycle and resets it to 0 whenever the processor is reset.

You'll need to figure out why your program needs that information, and how best to translate it to your ARM case.

How do I convert 32-bit NEON assembly to 64-bit?

TL;DR: use intrinsics

It's not a bad idea to check the asm output to make sure it's not dumb, but using intrinsics lets compilers do constant-propagation, and schedule / software-pipeline for in-order cores.

If you read the comment thread on that post from 2009 you linked, you'd see that the bad code from NEON intrinsics was a gcc bug fixed in 2011.

Compilers are quite good at handling intrinsics these days, and continually improving. Clang especially can do quite a lot, like use different shuffle instructions than what you wrote with intrinsics.

At least they are for x86; compilers for ARM still sometimes struggle with intrinsics, especially when trying to access the two 8-byte halves of a 16-byte vector like you often want to in 32-bit ARM code for horizontal operations. See ARM NEON intrinsics convert D (64-bit) register to low half of Q (128-bit) register, leaving upper half undefined / NEON intrinsic for sum of two subparts of a Q register - Jake Lee reports that as recently as 2018, some clang versions made a total mess out of it, but GCC6.x was not as bad.

This might not be as much of a problem with AArch64.

asm-level differences:

I'm not at all an expert on this, but one of the major NEON changes is that Aarch64 has thirty-two 128b NEON registers (v0 - v31), instead of each q register aliasing onto two d halves.

See also some official ARM documentation about syntax for element-size, where you can use .16B to indicate a vector of 16 byte elements. (As opposed to the old syntax where .8 meant each element was 8 bits.)

arm neon compare operations generate negative one

This is normal for vector compare instructions, so you can use the compare result as a mask with AND or XOR instructions, or various other use-cases.

You usually don't need a +1. If you want to count the number of elements that match, for example, just use a subtract instruction to subtract 0 or -1 from a vector accumulator.

To get an integer +1, you could subtract it from 0, or right-shift by element-size -1. (e.g. logical right-shift by 31 to leave just the low bit 0 or 1, and the rest of the bits all-zero). You could also AND with a vector of +1s that you created earlier.

I don't know which of these would be best for ARM, or if that would depend on the microarchitecture. (I really only know SIMD for x86 SSE/AVX.) I'm sure NEON can do at least one of the options I described, though.

Looping over arrays with inline assembly

Avoid inline asm whenever possible: https://gcc.gnu.org/wiki/DontUseInlineAsm. It blocks many optimizations. But if you really can't hand-hold the compiler into making the asm you want, you should probably write your whole loop in asm so you can unroll and tweak it manually, instead of doing stuff like this.

You can use an r constraint for the index. Use the q modifier to get the name of the 64bit register, so you can use it in an addressing mode. When compiled for 32bit targets, the q modifier selects the name of the 32bit register, so the same code still works.

If you want to choose what kind of addressing mode is used, you'll need to do it yourself, using pointer operands with r constraints.

GNU C inline asm syntax doesn't assume that you read or write memory pointed to by pointer operands. (e.g. maybe you're using an inline-asm and on the pointer value). So you need to do something with either a "memory" clobber or memory input/output operands to let it know what memory you modify. A "memory" clobber is easy, but forces everything except locals to be spilled/reloaded. See the Clobbers section in the docs for an example of using a dummy input operand.

Specifically, a "m" (*(const float (*)[]) fptr) will tell the compiler that the entire array object is an input, arbitrary-length. i.e. the asm can't reorder with any stores that use fptr as part of the address (or that use the array it's known to point into). Also works with an "=m" or "+m" constraint (without the const, obviously).

Using a specific size like "m" (*(const float (*)[4]) fptr) lets you tell the compiler what you do/don't read. (Or write). Then it can (if otherwise permitted) sink a store to a later element past the asm statement, and combine it with another store (or do dead-store elimination) of any stores that your inline asm doesn't read.

(See How can I indicate that the memory *pointed* to by an inline ASM argument may be used? for a whole Q&A about this.)

Another huge benefit to an m constraint is that -funroll-loops can work by generating addresses with constant offsets. Doing the addressing ourself prevents the compiler from doing a single increment every 4 iterations or something, because every source-level value of i needs to appear in a register.

Here's my version, with some tweaks as noted in comments. This is not optimal, e.g. can't be unrolled efficiently by the compiler.

#include <immintrin.h>
void add_asm1_memclobber(float *x, float *y, float *z, unsigned n) {
    __m128 vectmp;  // let the compiler choose a scratch register
    for(int i=0; i<n; i+=4) {
        __asm__ __volatile__ (
            "movaps   (%[y],%q[idx],4), %[vectmp]\n\t"  // q modifier: 64bit version of a GP reg
            "addps    (%[x],%q[idx],4), %[vectmp]\n\t"
            "movaps   %[vectmp], (%[z],%q[idx],4)\n\t"
            : [vectmp] "=x" (vectmp)  // "=m" (z[i])  // gives worse code if the compiler prepares a reg we don't use
            : [z] "r" (z), [y] "r" (y), [x] "r" (x),
              [idx] "r" (i) // unrolling is impossible this way (without an insn for every increment by 4)
            : "memory"
          // you can avoid a "memory" clobber with dummy input/output operands
        );
    }
}

Godbolt compiler explorer asm output for this and a couple versions below.

Your version needs to declare %xmm0 as clobbered, or you will have a bad time when this is inlined. My version uses a temporary variable as an output-only operand that's never used. This gives the compiler full freedom for register allocation.

If you want to avoid the "memory" clobber, you can use dummy memory input/output operands like "m" (*(const __m128*)&x[i]) to tell the compiler which memory is read and written by your function. This is necessary to ensure correct code-generation if you did something like x[4] = 1.0; right before running that loop. (And even if you didn't write something that simple, inlining and constant propagation can boil it down to that.) And also to make sure the compiler doesn't read from z[] before the loop runs.

In this case, we get horrible results: gcc5.x actually increments 3 extra pointers because it decides to use [reg] addressing modes instead of indexed. It doesn't know that the inline asm never actually references those memory operands using the addressing mode created by the constraint!

# gcc5.4 with dummy constraints like "=m" (*(__m128*)&z[i]) instead of "memory" clobber
.L11:
    movaps   (%rsi,%rax,4), %xmm0   # y, i, vectmp
    addps    (%rdi,%rax,4), %xmm0   # x, i, vectmp
    movaps   %xmm0, (%rdx,%rax,4)   # vectmp, z, i

    addl    $4, %eax        #, i
    addq    $16, %r10       #, ivtmp.19
    addq    $16, %r9        #, ivtmp.21
    addq    $16, %r8        #, ivtmp.22
    cmpl    %eax, %ecx      # i, n
    ja      .L11        #,

r8, r9, and r10 are the extra pointers that the inline asm block doesn't use.

You can use a constraint that tells gcc an entire array of arbitrary length is an input or an output: "m" (*(const char (*)[]) pStr). This casts the pointer to a pointer-to-array (of unspecified size). See How can I indicate that the memory *pointed* to by an inline ASM argument may be used?

If we want to use indexed addressing modes, we will have the base address of all three arrays in registers, and this form of constraint asks for the base address (of the whole array) as an operand, rather than a pointer to the current memory being operated on.

This actually works without any extra pointer or counter increments inside the loop: (avoiding a "memory" clobber, but still not easily unrollable by the compiler).

void add_asm1_dummy_whole_array(const float *restrict x, const float *restrict y,
                             float *restrict z, unsigned n) {
    __m128 vectmp;  // let the compiler choose a scratch register
    for(int i=0; i<n; i+=4) {
        __asm__ __volatile__ (
            "movaps   (%[y],%q[idx],4), %[vectmp]\n\t"  // q modifier: 64bit version of a GP reg
            "addps    (%[x],%q[idx],4), %[vectmp]\n\t"
            "movaps   %[vectmp], (%[z],%q[idx],4)\n\t"
            : [vectmp] "=x" (vectmp)
             , "=m" (*(float (*)[]) z)  // "=m" (z[i])  // gives worse code if the compiler prepares a reg we don't use
            : [z] "r" (z), [y] "r" (y), [x] "r" (x),
              [idx] "r" (i) // unrolling is impossible this way (without an insn for every increment by 4)
              , "m" (*(const float (*)[]) x),
                "m" (*(const float (*)[]) y)  // pointer to unsized array = all memory from this pointer
        );
    }
}

This gives us the same inner loop we got with a "memory" clobber:

.L19:   # with clobbers like "m" (*(const struct {float a; float x[];} *) y)
    movaps   (%rsi,%rax,4), %xmm0   # y, i, vectmp
    addps    (%rdi,%rax,4), %xmm0   # x, i, vectmp
    movaps   %xmm0, (%rdx,%rax,4)   # vectmp, z, i

    addl    $4, %eax        #, i
    cmpl    %eax, %ecx      # i, n
    ja      .L19        #,

It tells the compiler that each asm block reads or writes the entire arrays, so it may unnecessarily stop it from interleaving with other code (e.g. after fully unrolling with low iteration count). It doesn't stop unrolling, but the requirement to have each index value in a register does make it less effective. There's no way for this to end up with a 16(%rsi,%rax,4) addressing mode in a 2nd copy of this block in the same loop, because we're hiding the addressing from the compiler.

A version with m constraints, that gcc can unroll:

#include <immintrin.h>
void add_asm1(float *x, float *y, float *z, unsigned n) {
    // x, y, z are assumed to be aligned
    __m128 vectmp;  // let the compiler choose a scratch register
    for(int i=0; i<n; i+=4) {
        __asm__ __volatile__ (
           // "movaps   %[yi], %[vectmp]\n\t"   // get the compiler to do this load instead
            "addps    %[xi], %[vectmp]\n\t"
            "movaps   %[vectmp], %[zi]\n\t"
          // __m128 is a may_alias type so these casts are safe.
            : [vectmp] "=x" (vectmp)         // let compiler pick a stratch reg
              ,[zi] "=m" (*(__m128*)&z[i])   // actual memory output for the movaps store
            : [yi] "0"  (*(__m128*)&y[i])  // or [yi] "xm" (*(__m128*)&y[i]), and uncomment the movaps load
             ,[xi] "xm" (*(__m128*)&x[i])
              //, [idx] "r" (i) // unrolling with this would need an insn for every increment by 4
        );
    }
}

Using [yi] as a +x input/output operand would be simpler, but writing it this way makes a smaller change for uncommenting the load in the inline asm, instead of letting the compiler get one value into registers for us.

Rgba to Abgr: Inline Arm Neon Asm for iOS/Xcode