C++ Tips for Code Optimization on Arm Devices

C++ Tips for code optimization on ARM devices

To answer your question about general rules when optimizing C++ code for ARM, here are a few suggestions:

1) As you mentioned, there is no divide instruction. Use logical shifts or multiply by the inverse when possible.

2) Memory is much slower than CPU execution; use logical operations to avoid small lookup tables.

3) Try to write 32-bits at a time to make best use of the write buffer. Writing shorts or chars will slow the code down considerably. In other words, it's faster to logical-OR the smaller bits together and write them as DWORDS.

4) Be aware of your L1/L2 cache size. As a general rule, ARM chips have much smaller caches than Intel.

5) Use SIMD (NEON) when possible. NEON instructions are quite powerful and for "vectorizable" code, can be quite fast. NEON intrinsics are available in most C++ environments and can be nearly as fast as writing hand tuned ASM code.

6) Use the cache prefetch hint (PLD) to speed up looping reads. ARM doesn't have smart precache logic the way that modern Intel chips do.

7) Don't trust the compiler to generate good code. Look at the ASM output and rewrite hotspots in ASM. For bit/byte manipulation, the C language can't specify things as efficiently as they can be accomplished in ASM. ARM has powerful 3-operand instructions, multi-load/store and "free" shifts that can outperform what the compiler is capable of generating.

Arm loop code optimization for calculating mean standard deviation

That loop is a perfect candidate for NEON optimization. You can fit your 8 unsigned integers into a single NEON register. There is no "sum all elements of a vector" instruction, but you can use the pairwise add to compute the sum of the 8 elements in 3 steps. Since we can't see the rest of your application, it's hard to know what the big picture is, but NEON is your best bet for improving the speed. All recent Apple products support NEON instructions and in XCode you can use the NEON intrinsics mixed with your C++ code.

Micro-optimizing C code for ARM

unsigned int fun1 ( unsigned int a, unsigned int b )
{
    return(a/b);
}
unsigned int fun2 ( unsigned int a )
{
    return(a/2);
}
unsigned int fun3 ( unsigned int a )
{
    return(a/3);
}
unsigned int fun10 ( unsigned int a )
{
    return(a/10);
}
unsigned int fun13 ( void )
{
    return(10/13);
}

and just try it.

00000000 <fun1>:
   0:   e92d4010    push    {r4, lr}
   4:   ebfffffe    bl  0 <__aeabi_uidiv>
   8:   e8bd4010    pop {r4, lr}
   c:   e12fff1e    bx  lr

00000010 <fun2>:
  10:   e1a000a0    lsr r0, r0, #1
  14:   e12fff1e    bx  lr

00000018 <fun3>:
  18:   e59f3008    ldr r3, [pc, #8]    ; 28 <fun3+0x10>
  1c:   e0802093    umull   r2, r0, r3, r0
  20:   e1a000a0    lsr r0, r0, #1
  24:   e12fff1e    bx  lr
  28:   aaaaaaab    bge feaaaadc <fun13+0xfeaaaa9c>

0000002c <fun10>:
  2c:   e59f3008    ldr r3, [pc, #8]    ; 3c <fun10+0x10>
  30:   e0802093    umull   r2, r0, r3, r0
  34:   e1a001a0    lsr r0, r0, #3
  38:   e12fff1e    bx  lr
  3c:   cccccccd    stclgt  12, cr12, [r12], {205}  ; 0xcd

00000040 <fun13>:
  40:   e3a00000    mov r0, #0
  44:   e12fff1e    bx  lr

As one would expect, if the compiler couldn't deal with it compile-time then it calls the appropriate library function, which is the root of the performance issue. If you don't have a native divide instruction then you end up with many instructions executed, plus all of their fetches. 10 to 100 times slower sounds about right.

Interesting that they do use the 1/3 and 1/10th trick here, and if the result can be computed at compile time, then just return the fixed result.

Compiler authors can read the same Hackers Delight and Stack Overflow pages we can and know the same tricks and, if willing and interested, can implement those optimizations. Don't assume they always will; just because I have some version of some compiler that finds these doesn't mean all compilers can/will.

As far as whether you should let the compiler/toolchain do it or not for you: that's up to you; even if you have the divide instruction, if you target multiple platforms you may choose to shift right instead of divide by 2; you may choose to do other of these tricks. If you own the divide then you at least know what it is doing; if you give it over to the compiler then you have to regularly disassemble to understand what it is doing (if you care). If this is in a timing critical section then you may wish to do both, see what the compiler does, then steal that answer or create your own deterministic solution (leaving it up to the compiler is not necessarily deterministic and I think that is the point).

EDIT

arm-none-eabi-gcc -O2 -c so.c -o so.o
arm-none-eabi-objdump -D so.o

arm-none-eabi-gcc --version
arm-none-eabi-gcc (GCC) 6.3.0
Copyright (C) 2016 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

I have a gcc 4.8.3 here that also produced those optimizations...as well as a 5.4.0, so they have been doing it for a while.

The arm UMULL instruction is a 64 bit = 32 bit * 32 bit operation, so it can't overflow the multiply. Certainly for 1/3rd and 1/10th and not sure how large a value of N for 1/N you can go in 64 bits and have any 32 bit operand work. Performing a simple experiment shows that at least for these two cases all possible 32 bit patterns work that is for unsigned.

It appears to use the trick for signed as well:

int negfun ( int a )
{
    return(a/3);
}

00000000 <negfun>:
   0:   e59f3008    ldr r3, [pc, #8]    ; 10 <negfun+0x10>
   4:   e0c32390    smull   r2, r3, r0, r3
   8:   e0430fc0    sub r0, r3, r0, asr #31
   c:   e12fff1e    bx  lr
  10:   55555556    ldrbpl  r5, [r5, #-1366]    ; 0xfffffaaa

Understanding the Optimization of C code on ARM platform using GCC Compiler

When using gcc to write code for embedded systems, it is important to note that unlike many embedded-systems compilers which allow storage may be written as one type and read as another, and will treat integer overflow in at least somewhat predictable fashion, code which relies upon such behaviors is apt to break when compiled with gcc unless compiled using -fno-strict-aliasing and -fwrapv flags. For example, while the authors of the C Standard would have expected that the function

// Assume usmall is an unsigned value half the size of unsigned int
unsigned multiply(usmall x, usmall y) { return x*y; }

should be safe on two's-complement hardware platforms with silent wraparound on overflow, they didn't require compilers to implement it that way (I think they probably didn't expect that anyone who was writing a compiler for such a platform would be so obtuse as to do otherwise unless emulating some other platform). When compiled with gcc, however, that function may have unexpected side-effects.

Likewise, on many compilers, given e.g.

struct widget_header {uint16_t length; uint8_t type_id;};
struct acme_widget {uint16_t length; uint8_t type_id; uint8_t dat[5];};
struct beta_widget {uint16_t length; uint8_t type_id; uint32_t foo;};

a pointer to any of those types could be cast to widget_header; code
could then look at the type_id field and cast to a more specific type.
Such techniques will not always work on gcc, however; even if a union
declaration containing all three types is in scope, gcc will assume that
an access to a field of one of those types cannot possibly modify the
corresponding field in any other.

A more concrete example to show how gcc treats aliasing:

    struct s1 { int x; };
    struct s2 { int x; };
    union { struct s1 v1; struct s2 v2; } u;

    static int peek(void *p)
    {
      struct s1 *p1 = (struct s1*)p;
      return *(int*)&p1->x;
    }

    static void setToFive(void *p)
    {
      struct s2 *p2 = (struct s2*)p;
      *(int*)(&p2->x) = 5;
    }

    static int test1a(void *p, void *q)
    {
      struct s1 *p1 = (struct s1*)p;
      if (peek(p)!=23) return 0;
      setToFive(q);
      return peek(p);
    }

    int test1(void)
    {
      struct s2 v2 = {23};
      u.v2 = v2;
      return test1a(&u.v1, &u.v2);
    }

ARM gcc 4.8.2 generates

test1:
        movw    r3, #:lower16:u
        movt    r3, #:upper16:u
        movs    r2, #5
        movs    r0, #23
        str     r2, [r3]
        bx      lr

which stores a 5 into "u" and then returns 23 (assuming that the second call to peek will return the same value as the first one, notwithstanding all of the pointer typecasts which should give a pretty clear indication to the compiler that something might possibly be aliased somewhere).

SRAM usage optimization in ARM devices

Bus size does not matter in this case. Memory usage will be the the same.

Some Cortex cores do not allow not aligned access. What is unaligned access? Unaligned memory accesses occur when you try to access (as single operation) N bytes of data starting from an address that is not evenly divisible by N (i.e. addr % N != 0). In our case N can be 1, 2 and 4.

your example should be analyzed with optimizations turned on.

#define BUFFER_SIZE         ((uint8_t)0x20)

uint8_t aTxBuffer[BUFFER_SIZE];

void init(uint8_t x)
{
    for(int i=0; i<BUFFER_SIZE; i++)
    {
            aTxBuffer[i]=x;   
    }
}

The STM32F0 which does not allow unaligned access will have to store the data byte by byte

init:
  ldr r3, .L5
  movs r2, r3
  adds r2, r2, #32
.L2:
  strb r0, [r3]
  adds r3, r3, #1
  cmp r3, r2
  bne .L2
  bx lr
.L5:
  .word aTxBuffer

but stm32F4 will faster (in less operations) store the full words 32birs - 4 bytes.

init:
        movs    r3, #0
        bfi     r3, r0, #0, #8
        bfi     r3, r0, #8, #8
        ldr     r2, .L3
        bfi     r3, r0, #16, #8
        bfi     r3, r0, #24, #8
        str     r3, [r2]  @ unaligned
        str     r3, [r2, #4]      @ unaligned
        str     r3, [r2, #8]      @ unaligned
        str     r3, [r2, #12]     @ unaligned
        str     r3, [r2, #16]     @ unaligned
        str     r3, [r2, #20]     @ unaligned
        str     r3, [r2, #24]     @ unaligned
        str     r3, [r2, #28]     @ unaligned
        bx      lr
.L3:
        .word   aTxBuffer

the SRAM consumption is exactly the same in both cases

ARM GCC removing required code during optimization

In this code:

lum = ((RSCALE * r) + (GSCALE * g) + (BSCALE * b));
if(lum > 0x7FFFFFFF)

RSCALE, GSCALE, and BSCALE are 5014709, 9848225, and 1912602, respectively. Assuming int is 32 bits in the C implementation being used, these are all int constants.

r, g, and b are all of type uint8_t, so they are promoted to int in the multiplications.

Then ((RSCALE * r) + (GSCALE * g) + (BSCALE * b)) is a calculation entirely with int operands producing an int result. So the compiler can see that lum is assigned the value of an int result, and it is entitled to assume the result is in the range INT_MIN to INT_MAX. Further, it can see all operands are nonnegative, and therefore negative results are not possible, reducing the possible range to 0 to INT_MAX. This excludes the possibility that assigning a negative value to a uint32_t will cause wrapping to a high value. So the compiler may assume lum > 0x7FFFFFFF is never true.

The calculation may overflow int, and then the behavior is undefined, and the compiler is still allowed to use the assumption.

To correct this, change at least one operand of each multiplication to unsigned.

How to maximize application performance for ARM little.big architecture - MPI

Your problem is twofold.

First, there are cores with varying instruction sets. Most MPI implementations provide an easy solution for that by allowing you to run jobs from more than one executable. You simply need to compile the code twice with core-specific optimisations in order to produce two executable files. Let's call them prog.big (optimised for the big cores) and prog.little (optimised for the LITTLE cores). Then, instead of launching 6 ranks from a generic executable with mpiexec -n 6 ./prog, you launch 4 ranks from prog.big and 2 ranks from prog.little:

mpiexec -n 4 ./prog.bin : -n 2 ./prog.little

That's not enough though. You need to place the right process on the right core. Doing so is very implementation-specific. In the simplest case, you can tell MPI to pin/bind each MPI rank to a single logical CPU and do so in a linear fashion, i.e., rank 0 gets bound to core 0, rank 1 to core 1, etc. and hope that the OS will map the big cores to logical CPUs 0 to 3 and the LITTLE cores to logical CPUs 4 and 5. If that is not the case, you may need to perform some additional acrobatics. For example, Open MPI allows you to specify a rankfile with --rankfile filename, in which you can provide a rank to CPU mapping:

rank 0=localhost slot=0
rank 1=localhost slot=1
rank 2=localhost slot=2
...

Having optimised executable files and properly placed processes is only half of the solution. The rest is to actually have a parallel algorithm that can make use of CPUs with different speeds. If you have a globally synchronous algorithm, for example one solving PDEs, or anything iterative in general, then the computation time of a single step is that of the slowest MPI rank. If you give the same amount of work to the big and to the LITTLE cores, the latter will lag significantly and the former will have to wait, wasting computational time. So you need to either perform some advanced domain decomposition and give smaller work items to the slower cores or use an approach such as "bag of work" (a.k.a. controller/worker) and have each worker rank request a piece of data to work on. In this case, faster cores will process more items and the work will balance itself automatically.

C code optimization

Unfortunately, your first step was already kindof wrong.

Don't blindly walk through your code optimizing arbitrary loops which might or (more probably) might not affect performance (maybe because that code is so rarely called that it doesn't really use any run-time).

Optimizing means: You need to find out first where is the time spent in my program? Use timing measurements to narrow down where your program spends most of its time (you can use homegrown logging using us timers or a profiling application for that). Without at least some figures you wouldn't even see where the compiler has already helped you and maybe has already maxed out all possibilities, even if your code looks like it has some potential left for being faster (modern compilers are really very good at that).

Only if you know the hot spots in your application you should start optimizing those.

C++ Tips for Code Optimization on Arm Devices