Coding Practices Which Enable the Compiler/Optimizer to Make a Faster Program

Coding Practices which enable the compiler/optimizer to make a faster program

Write to local variables and not output arguments! This can be a huge help for getting around aliasing slowdowns. For example, if your code looks like

void DoSomething(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut)
{
    for (int i=0; i<numFoo, i++)
    {
         barOut.munge(foo1, foo2[i]);
    }
}

the compiler doesn't know that foo1 != barOut, and thus has to reload foo1 each time through the loop. It also can't read foo2[i] until the write to barOut is finished. You could start messing around with restricted pointers, but it's just as effective (and much clearer) to do this:

void DoSomethingFaster(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut)
{
    Foo barTemp = barOut;
    for (int i=0; i<numFoo, i++)
    {
         barTemp.munge(foo1, foo2[i]);
    }
    barOut = barTemp;
}

It sounds silly, but the compiler can be much smarter dealing with the local variable, since it can't possibly overlap in memory with any of the arguments. This can help you avoid the dreaded load-hit-store (mentioned by Francis Boivin in this thread).

Coding Practices which enable the compiler/optimizer to make a faster program

Write to local variables and not output arguments! This can be a huge help for getting around aliasing slowdowns. For example, if your code looks like

void DoSomething(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut)
{
    for (int i=0; i<numFoo, i++)
    {
         barOut.munge(foo1, foo2[i]);
    }
}

void DoSomethingFaster(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut)
{
    Foo barTemp = barOut;
    for (int i=0; i<numFoo, i++)
    {
         barTemp.munge(foo1, foo2[i]);
    }
    barOut = barTemp;
}

What kinds of C optimization practices in x86 that were historically recommended to use are no longer effective?

Of the optimizations which are commonly recommended, a couple which are basically never fruitful given modern compilers include:

Mathematical transformations

Modern compilers understand mathematics, and will perform transformations on mathematical expressions where appropriate.

Optimizations such as conversion of multiplication to addition, or constant multiplication or division to bit shifting, are already performed by modern compilers, even at low optimization levels. Examples of these optimizations include:

x * 2  ->  x + x
x * 2  ->  x << 1

Note that some specific cases may differ. For instance, x >> 1 is not the same as x / 2; it is not appropriate to substitute one for the other!

Additionally, many of these suggested optimizations aren't actually any faster than the code they replace.

Stupid code tricks

I'm not even sure what to call this, but tricks like XOR swapping (a ^= b; b ^= a; a ^= b;) are not optimizations at all. They're just party tricks — they are slower, and more fragile, than the obvious approach. Don't use them.

The `register` keyword

This keyword is ignored by many modern compilers, as it its intended meaning (forcing a variable to be stored in a register) is not meaningful given current register allocation algorithms.

Code transformations

Compilers will automatically perform a wide variety of code transformations where appropriate. Several such transformations which are frequently recommended for manual application, but which are rarely useful when applied thus, include:

Loop unrolling. (This is often actually harmful when applied indiscriminately, as it bloats code size.)
Function inlining. (Tag a function as static, and it'll usually be inlined where appropriate when optimization is enabled.)

GCC optimization trick, does it really work?

The example given in that answer was not a very good one because of the call to an unknown function the compiler cannot reason much about. Here's a better example:

void FillOneA(int *array, int length, int& startIndex)
{
    for (int i = 0; i < length; i++) array[startIndex + i] = 1;
}

void FillOneB(int *array, int length, int& startIndex)
{
    int localIndex = startIndex;
    for (int i = 0; i < length; i++) array[localIndex + i] = 1;
}

The first version optimizes poorly because it needs to protect against the possibility that somebody called it as

int array[10] = { 0 };
FillOneA(array, 5, array[1]);

resulting in {1, 1, 0, 1, 1, 1, 0, 0, 0, 0 } since the iteration with i=1 modifies the startIndex parameter.

The second one doesn't need to worry about the possibility that the array[localIndex + i] = 1 will modify localIndex because localIndex is a local variable whose address has never been taken.

In assembly (Intel notation, because that's what I use):

FillOneA:
    mov     edx, [esp+8]
    xor     eax, eax
    test    edx, edx
    jle     $b
    push    esi
    mov     esi, [esp+16]
    push    edi
    mov     edi, [esp+12]
$a: mov     ecx, [esi]
    add     ecx, eax
    inc     eax
    mov     [edi+ecx*4], 1
    cmp     eax, edx
    jl      $a
    pop     edi
    pop     esi
$b: ret

FillOneB:
    mov     ecx, [esp+8]
    mov     eax, [esp+12]
    mov     edx, [eax]
    test    ecx, ecx
    jle     $a
    mov     eax, [esp+4]
    push    edi
    lea     edi, [eax+edx*4]
    mov     eax, 1
    rep stosd
    pop     edi
$a: ret

ADDED: Here's an example where the compiler's insight is into Bar, and not munge:

class Bar
{
public:
    float getValue() const
    {
        return valueBase * boost;
    }

private:
    float valueBase;
    float boost;
};

class Foo
{
public:
    void munge(float adjustment);
};

void Adjust10A(Foo& foo, const Bar& bar)
{
    for (int i = 0; i < 10; i++)
        foo.munge(bar.getValue());
}

void Adjust10B(Foo& foo, const Bar& bar)
{
    Bar localBar = bar;
    for (int i = 0; i < 10; i++)
        foo.munge(localBar.getValue());
}

The resulting code is

Adjust10A:
    push    ecx
    push    ebx
    mov     ebx, [esp+12] ;; foo
    push    esi
    mov     esi, [esp+20] ;; bar
    push    edi
    mov     edi, 10
$a: fld     [esi+4] ;; bar.valueBase
    push    ecx
    fmul    [esi] ;; valueBase * boost
    mov     ecx, ebx
    fstp    [esp+16]
    fld     [esp+16]
    fstp    [esp]
    call    Foo::munge
    dec     edi
    jne     $a
    pop     edi
    pop     esi
    pop     ebx
    pop     ecx
    ret     0

Adjust10B:
    sub     esp, 8
    mov     ecx, [esp+16] ;; bar
    mov     eax, [ecx] ;; bar.valueBase
    mov     [esp], eax ;; localBar.valueBase
    fld     [esp] ;; localBar.valueBase
    mov     eax, [ecx+4] ;; bar.boost
    mov     [esp+4], eax ;; localBar.boost
    fmul    [esp+4] ;; localBar.getValue()
    push    esi
    push    edi
    mov     edi, [esp+20] ;; foo
    fstp    [esp+24]
    fld     [esp+24] ;; cache localBar.getValue()
    mov     esi, 10 ;; loop counter
$a: push    ecx
    mov     ecx, edi ;; foo
    fstp    [esp] ;; use cached value
    call    Foo::munge
    fld     [esp]
    dec     esi
    jne     $a ;; loop
    pop     edi
    fstp    ST(0)
    pop     esi
    add     esp, 8
    ret     0

Observe that the inner loop in Adjust10A must recalculate the value since it must protect against the possibility that foo.munge changed bar.

That said, this style of optimization is not a slam dunk. (For example, we could've gotten the same effect by manually caching bar.getValue() into localValue.) It tends to be most helpful for vectorized operations, since those can be paralellized.

Why using pointers (with low optimization) makes the program faster?

In general: there is no reason why using pointers would make the program run faster. Discussing performance of programs without all optimizations enabled, like the course creator did in your quote, is not meaningful. It is certainly not a reason to change the way you write code.

Another old, often used but obsolete trick, is to write such loops as down-counting instead of up-counting, since compares against zero are often faster than compares against values. But this is also something you should not let affect the way you write code, because a modern compiler can do that optimization for you.

What programmers should do, and what people writing courses should teach, is to write the code as plain and readable as possible. Meaning that both your examples are bad, since they are needlessly obscure and examples of "pre-mature optimization". Better code would be:

  int counter;
  ...
  for(counter=0; counter < 6; counter++)
  {}

This is about as readable as code gets and there is no reason to believe that the above would perform worse than your examples on any known system.

Do this:

Write the most readable code you can.
In release, enable optimizations.
If there are performance problems, benchmark and find bottlenecks.
Manually optimize the bottlenecks if needed. Possibly with a specific system in mind.

Ways to make a D program faster

Compiler flags: I would add -mcpu=native if the program will be running on your machine. Not sure what effect -Os has in addition to -O3.
Profiling has been mentioned in comments. Personally under Linux I have a script which dumps a process's stack trace and I do that a few times to get an idea of where it's getting hung up on.
Not sure what you mean by GS.
Since you mentioned classes: in D, methods are virtual by default; virtual methods add indirections and are not inlineable. Make sure only those methods that must be virtual are. See if you can rewrite your program using a form of polymorphism that doesn't involve indirections, such as using template metaprogramming.
Since you mentioned associative arrays: these make heavy use of the GC; to speed them up, switch to a third-party library that works on top of std.allocator, such as https://github.com/dlang-community/containers
If some parts of your code are parallelizable, std.parallelism is a good tool for this.
Since you mentioned that the project is an interpreter: there are many avenues for optimizing them, up to JIT/AOT compilation. Perhaps you could link to an existing library such as LLVM or libjit.

Multiple methods of taking an address

There are no other ways to take the address[*]. The text "using the & operator" in that advice is redundant. I suppose it's just there to remind you which operator is the address-of operator, in case you weren't clear what you're supposed to avoid using.

The optimizations that it's talking about are:

1) if you don't take the address, then the variable doesn't need to have an address.

2) if the address never "escapes" code that the compiler can see, then the compiler can assume the value doesn't change via code that it can't see. Not taking the address ensures this, but in some cases compilers can do escape analysis even when the address is taken.

It really doesn't matter for either of these purposes how the address is taken, so if there were another way to take the address in C, then the advice would have to mention it alongside the & operator.

[*] [Edit: aha! I'm slightly wrong. If the variable is an array then you can take its address without using the & operator, just let it decay to a pointer-to-first-element. I don't know whether or not the IAR compiler actually optimizes sufficiently small arrays into registers, if not then there's no need to mention it. But it's certainly allowed to.]

Coding Practices Which Enable the Compiler/Optimizer to Make a Faster Program

Coding Practices which enable the compiler/optimizer to make a faster program

Coding Practices which enable the compiler/optimizer to make a faster program

What kinds of C optimization practices in x86 that were historically recommended to use are no longer effective?

Mathematical transformations

Stupid code tricks

The `register` keyword

Code transformations

GCC optimization trick, does it really work?

Why using pointers (with low optimization) makes the program faster?

Ways to make a D program faster

Multiple methods of taking an address

Related Topics

Leave a reply

Coding Practices which enable the compiler/optimizer to make a faster program

Coding Practices which enable the compiler/optimizer to make a faster program

What kinds of C optimization practices in x86 that were historically recommended to use are no longer effective?

Mathematical transformations

Stupid code tricks

The register keyword

Code transformations

GCC optimization trick, does it really work?

Why using pointers (with low optimization) makes the program faster?

Ways to make a D program faster

Multiple methods of taking an address

Related Topics

Leave a reply

The `register` keyword