Coding Practices Which Enable the Compiler/Optimizer to Make a Faster Program

Coding Practices which enable the compiler/optimizer to make a faster program

Write to local variables and not output arguments! This can be a huge help for getting around aliasing slowdowns. For example, if your code looks like

void DoSomething(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut)
{
for (int i=0; i<numFoo, i++)
{
barOut.munge(foo1, foo2[i]);
}
}

the compiler doesn't know that foo1 != barOut, and thus has to reload foo1 each time through the loop. It also can't read foo2[i] until the write to barOut is finished. You could start messing around with restricted pointers, but it's just as effective (and much clearer) to do this:

void DoSomethingFaster(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut)
{
Foo barTemp = barOut;
for (int i=0; i<numFoo, i++)
{
barTemp.munge(foo1, foo2[i]);
}
barOut = barTemp;
}

It sounds silly, but the compiler can be much smarter dealing with the local variable, since it can't possibly overlap in memory with any of the arguments. This can help you avoid the dreaded load-hit-store (mentioned by Francis Boivin in this thread).

Coding Practices which enable the compiler/optimizer to make a faster program

Write to local variables and not output arguments! This can be a huge help for getting around aliasing slowdowns. For example, if your code looks like

void DoSomething(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut)
{
for (int i=0; i<numFoo, i++)
{
barOut.munge(foo1, foo2[i]);
}
}

the compiler doesn't know that foo1 != barOut, and thus has to reload foo1 each time through the loop. It also can't read foo2[i] until the write to barOut is finished. You could start messing around with restricted pointers, but it's just as effective (and much clearer) to do this:

void DoSomethingFaster(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut)
{
Foo barTemp = barOut;
for (int i=0; i<numFoo, i++)
{
barTemp.munge(foo1, foo2[i]);
}
barOut = barTemp;
}

It sounds silly, but the compiler can be much smarter dealing with the local variable, since it can't possibly overlap in memory with any of the arguments. This can help you avoid the dreaded load-hit-store (mentioned by Francis Boivin in this thread).

What kinds of C optimization practices in x86 that were historically recommended to use are no longer effective?

Of the optimizations which are commonly recommended, a couple which are basically never fruitful given modern compilers include:

Mathematical transformations

Modern compilers understand mathematics, and will perform transformations on mathematical expressions where appropriate.

Optimizations such as conversion of multiplication to addition, or constant multiplication or division to bit shifting, are already performed by modern compilers, even at low optimization levels. Examples of these optimizations include:

x * 2  ->  x + x
x * 2 -> x << 1

Note that some specific cases may differ. For instance, x >> 1 is not the same as x / 2; it is not appropriate to substitute one for the other!

Additionally, many of these suggested optimizations aren't actually any faster than the code they replace.

Stupid code tricks

I'm not even sure what to call this, but tricks like XOR swapping (a ^= b; b ^= a; a ^= b;) are not optimizations at all. They're just party tricks — they are slower, and more fragile, than the obvious approach. Don't use them.

The register keyword

This keyword is ignored by many modern compilers, as it its intended meaning (forcing a variable to be stored in a register) is not meaningful given current register allocation algorithms.

Code transformations

Compilers will automatically perform a wide variety of code transformations where appropriate. Several such transformations which are frequently recommended for manual application, but which are rarely useful when applied thus, include:

  • Loop unrolling. (This is often actually harmful when applied indiscriminately, as it bloats code size.)
  • Function inlining. (Tag a function as static, and it'll usually be inlined where appropriate when optimization is enabled.)

GCC optimization trick, does it really work?

The example given in that answer was not a very good one because of the call to an unknown function the compiler cannot reason much about. Here's a better example:

void FillOneA(int *array, int length, int& startIndex)
{
for (int i = 0; i < length; i++) array[startIndex + i] = 1;
}

void FillOneB(int *array, int length, int& startIndex)
{
int localIndex = startIndex;
for (int i = 0; i < length; i++) array[localIndex + i] = 1;
}

The first version optimizes poorly because it needs to protect against the possibility that somebody called it as

int array[10] = { 0 };
FillOneA(array, 5, array[1]);

resulting in {1, 1, 0, 1, 1, 1, 0, 0, 0, 0 } since the iteration with i=1 modifies the startIndex parameter.

The second one doesn't need to worry about the possibility that the array[localIndex + i] = 1 will modify localIndex because localIndex is a local variable whose address has never been taken.

In assembly (Intel notation, because that's what I use):

FillOneA:
mov edx, [esp+8]
xor eax, eax
test edx, edx
jle $b
push esi
mov esi, [esp+16]
push edi
mov edi, [esp+12]
$a: mov ecx, [esi]
add ecx, eax
inc eax
mov [edi+ecx*4], 1
cmp eax, edx
jl $a
pop edi
pop esi
$b: ret

FillOneB:
mov ecx, [esp+8]
mov eax, [esp+12]
mov edx, [eax]
test ecx, ecx
jle $a
mov eax, [esp+4]
push edi
lea edi, [eax+edx*4]
mov eax, 1
rep stosd
pop edi
$a: ret

ADDED: Here's an example where the compiler's insight is into Bar, and not munge:

class Bar
{
public:
float getValue() const
{
return valueBase * boost;
}

private:
float valueBase;
float boost;
};

class Foo
{
public:
void munge(float adjustment);
};

void Adjust10A(Foo& foo, const Bar& bar)
{
for (int i = 0; i < 10; i++)
foo.munge(bar.getValue());
}

void Adjust10B(Foo& foo, const Bar& bar)
{
Bar localBar = bar;
for (int i = 0; i < 10; i++)
foo.munge(localBar.getValue());
}

The resulting code is

Adjust10A:
push ecx
push ebx
mov ebx, [esp+12] ;; foo
push esi
mov esi, [esp+20] ;; bar
push edi
mov edi, 10
$a: fld [esi+4] ;; bar.valueBase
push ecx
fmul [esi] ;; valueBase * boost
mov ecx, ebx
fstp [esp+16]
fld [esp+16]
fstp [esp]
call Foo::munge
dec edi
jne $a
pop edi
pop esi
pop ebx
pop ecx
ret 0

Adjust10B:
sub esp, 8
mov ecx, [esp+16] ;; bar
mov eax, [ecx] ;; bar.valueBase
mov [esp], eax ;; localBar.valueBase
fld [esp] ;; localBar.valueBase
mov eax, [ecx+4] ;; bar.boost
mov [esp+4], eax ;; localBar.boost
fmul [esp+4] ;; localBar.getValue()
push esi
push edi
mov edi, [esp+20] ;; foo
fstp [esp+24]
fld [esp+24] ;; cache localBar.getValue()
mov esi, 10 ;; loop counter
$a: push ecx
mov ecx, edi ;; foo
fstp [esp] ;; use cached value
call Foo::munge
fld [esp]
dec esi
jne $a ;; loop
pop edi
fstp ST(0)
pop esi
add esp, 8
ret 0

Observe that the inner loop in Adjust10A must recalculate the value since it must protect against the possibility that foo.munge changed bar.

That said, this style of optimization is not a slam dunk. (For example, we could've gotten the same effect by manually caching bar.getValue() into localValue.) It tends to be most helpful for vectorized operations, since those can be paralellized.

Why using pointers (with low optimization) makes the program faster?

In general: there is no reason why using pointers would make the program run faster. Discussing performance of programs without all optimizations enabled, like the course creator did in your quote, is not meaningful. It is certainly not a reason to change the way you write code.

Another old, often used but obsolete trick, is to write such loops as down-counting instead of up-counting, since compares against zero are often faster than compares against values. But this is also something you should not let affect the way you write code, because a modern compiler can do that optimization for you.

What programmers should do, and what people writing courses should teach, is to write the code as plain and readable as possible. Meaning that both your examples are bad, since they are needlessly obscure and examples of "pre-mature optimization". Better code would be:

  int counter;
...
for(counter=0; counter < 6; counter++)
{}

This is about as readable as code gets and there is no reason to believe that the above would perform worse than your examples on any known system.

Do this:

  • Write the most readable code you can.
  • In release, enable optimizations.
  • If there are performance problems, benchmark and find bottlenecks.
  • Manually optimize the bottlenecks if needed. Possibly with a specific system in mind.

Ways to make a D program faster

  • Compiler flags: I would add -mcpu=native if the program will be running on your machine. Not sure what effect -Os has in addition to -O3.

  • Profiling has been mentioned in comments. Personally under Linux I have a script which dumps a process's stack trace and I do that a few times to get an idea of where it's getting hung up on.

  • Not sure what you mean by GS.

  • Since you mentioned classes: in D, methods are virtual by default; virtual methods add indirections and are not inlineable. Make sure only those methods that must be virtual are. See if you can rewrite your program using a form of polymorphism that doesn't involve indirections, such as using template metaprogramming.

  • Since you mentioned associative arrays: these make heavy use of the GC; to speed them up, switch to a third-party library that works on top of std.allocator, such as https://github.com/dlang-community/containers

  • If some parts of your code are parallelizable, std.parallelism is a good tool for this.

  • Since you mentioned that the project is an interpreter: there are many avenues for optimizing them, up to JIT/AOT compilation. Perhaps you could link to an existing library such as LLVM or libjit.

Multiple methods of taking an address

There are no other ways to take the address[*]. The text "using the & operator" in that advice is redundant. I suppose it's just there to remind you which operator is the address-of operator, in case you weren't clear what you're supposed to avoid using.

The optimizations that it's talking about are:

1) if you don't take the address, then the variable doesn't need to have an address.

2) if the address never "escapes" code that the compiler can see, then the compiler can assume the value doesn't change via code that it can't see. Not taking the address ensures this, but in some cases compilers can do escape analysis even when the address is taken.

It really doesn't matter for either of these purposes how the address is taken, so if there were another way to take the address in C, then the advice would have to mention it alongside the & operator.

[*] [Edit: aha! I'm slightly wrong. If the variable is an array then you can take its address without using the & operator, just let it decay to a pointer-to-first-element. I don't know whether or not the IAR compiler actually optimizes sufficiently small arrays into registers, if not then there's no need to mention it. But it's certainly allowed to.]



Related Topics



Leave a reply



Submit