For Loop Optimization

for loop optimization

It is better to use for-each loop [more readable]

for (Flower flower :flowers){
//...
}

I have dumped instructions using javap for the following code:

public void forLoop1() {
List<String> lst = new ArrayList<String>();
for (int i = 0; i < lst.size(); i++) {
System.out.println("hi");
}
}

public void forLoop2() {
List<String> lst = new ArrayList<String>();
int size = lst.size();
for (int i = 0; i < size; i++) {
System.out.println("hi");
}
}

public void forLoop1();
Code:
0: new #2; //class java/util/ArrayList
3: dup
4: invokespecial #3; //Method java/util/ArrayList."<init>":()V
7: astore_1
8: iconst_0
9: istore_2
10: iload_2
11: aload_1
12: invokeinterface #4, 1; //InterfaceMethod java/util/List.size:()I
17: if_icmpge 34
20: getstatic #5; //Field java/lang/System.out:Ljava/io/PrintStream;
23: ldc #6; //String hi
25: invokevirtual #7; //Method java/io/PrintStream.println:(Ljava/lang/Str
ing;)V
28: iinc 2, 1
31: goto 10
34: return

public void forLoop2();
Code:
0: new #2; //class java/util/ArrayList
3: dup
4: invokespecial #3; //Method java/util/ArrayList."<init>":()V
7: astore_1
8: aload_1
9: invokeinterface #4, 1; //InterfaceMethod java/util/List.size:()I
14: istore_2
15: iconst_0
16: istore_3
17: iload_3
18: iload_2
19: if_icmpge 36
22: getstatic #5; //Field java/lang/System.out:Ljava/io/PrintStream;
25: ldc #6; //String hi
27: invokevirtual #7; //Method java/io/PrintStream.println:(Ljava/lang/Str
ing;)V
30: iinc 3, 1
33: goto 17
36: return

It doesn't optimize for me.

java version "1.6.0_22" Java(TM) SE
Runtime Environment (build
1.6.0_22-b04) Java HotSpot(TM) Client VM (build 17.1-b03, mixed mode,
sharing)

So if you need to choose from mentioned two, go for second, but I personally would go for for-each.


for-each Performance

From Item 46 in Effective Java by Joshua Bloch :

The for-each loop, introduced in
release 1.5, gets rid of the clutter
and the opportunity for error by
hiding the iterator or index variable
completely. The resulting idiom
applies equally to collections and
arrays:

// The preferred idiom for iterating over collections and arrays
for (Element e : elements) {
doSomething(e);
}

When you see the colon (:), read it as
“in.” Thus, the loop above reads as
“for each element e in elements.” Note
that there is no performance penalty
for using the for-each loop, even for
arrays. In fact, it may offer a slight
performance advantage over an ordinary
for loop in some circumstances, as it
computes the limit of the array index
only once. While you can do this by
hand (Item 45), programmers don’t
always do so.


See Also

  • Is-there-a-performance-difference-between-a-for-loop-and-a-for-each-loop

Java for-loop optimization

It's so easy to get fooled by hand-made microbenchmarks - you never know what they actually measure. That's why there are special tools like JMH. But let's analyze what happens to the primitive hand-made benchmark:

static class HDouble {
double value;
}

public static void main(String[] args) {
primitive();
wrapper();
}

public static void primitive() {
long start = System.nanoTime();
for (double d = 0; d < 1000000000; d++) {
}
long end = System.nanoTime();
System.out.printf("Primitive: %.3f s\n", (end - start) / 1e9);
}

public static void wrapper() {
HDouble d = new HDouble();
long start = System.nanoTime();
for (d.value = 0; d.value < 1000000000; d.value++) {
}
long end = System.nanoTime();
System.out.printf("Wrapper: %.3f s\n", (end - start) / 1e9);
}

The results are somewhat similar to yours:

Primitive: 3.618 s
Wrapper: 1.380 s

Now repeat the test several times:

public static void main(String[] args) {
for (int i = 0; i < 5; i++) {
primitive();
wrapper();
}
}

It gets more interesting:

Primitive: 3.661 s
Wrapper: 1.382 s
Primitive: 3.461 s
Wrapper: 1.380 s
Primitive: 1.376 s <-- starting from 3rd iteration
Wrapper: 1.381 s <-- the timings become equal
Primitive: 1.371 s
Wrapper: 1.372 s
Primitive: 1.379 s
Wrapper: 1.378 s

Looks like both methods got finally optimized. Run it once again, now with logging JIT compiler activity:
-XX:-TieredCompilation -XX:CompileOnly=Test -XX:+PrintCompilation

    136    1 %           Test::primitive @ 6 (53 bytes)
3725 1 % Test::primitive @ -2 (53 bytes) made not entrant
Primitive: 3.589 s
3748 2 % Test::wrapper @ 17 (73 bytes)
5122 2 % Test::wrapper @ -2 (73 bytes) made not entrant
Wrapper: 1.374 s
5122 3 Test::primitive (53 bytes)
5124 4 % Test::primitive @ 6 (53 bytes)
Primitive: 3.421 s
8544 5 Test::wrapper (73 bytes)
8547 6 % Test::wrapper @ 17 (73 bytes)
Wrapper: 1.378 s
Primitive: 1.372 s
Wrapper: 1.375 s
Primitive: 1.378 s
Wrapper: 1.373 s
Primitive: 1.375 s
Wrapper: 1.378 s

Note % sign in the compilation log on the first iteration. It means that the methods were compiled in OSR (on-stack replacement) mode. During the second iteration the methods were recompiled in normal mode. Since then, starting from the third iteration, there was no difference between primitive and wrapper in execution speed.

What you've actually measured is the performance of OSR stub. It is usually not related to the real performance of an application and you shouldn't care much about it.

But the question still remains, why OSR stub for a wrapper is compiled better than for a primitive variable? To find this out we need to get down to generated assembly code:

-XX:CompileOnly=Test -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly

I'll omit all unrelevant code leaving only the compiled loop.

Primitive:

0x00000000023e90d0: vmovsd 0x28(%rsp),%xmm1      <-- load double from the stack
0x00000000023e90d6: vaddsd -0x7e(%rip),%xmm1,%xmm1
0x00000000023e90de: test %eax,-0x21f90e4(%rip)
0x00000000023e90e4: vmovsd %xmm1,0x28(%rsp) <-- store to the stack
0x00000000023e90ea: vucomisd 0x28(%rsp),%xmm0 <-- compare with the stack value
0x00000000023e90f0: ja 0x00000000023e90d0

Wrapper:

0x00000000023ebe90: vaddsd -0x78(%rip),%xmm0,%xmm0
0x00000000023ebe98: vmovsd %xmm0,0x10(%rbx) <-- store to the object field
0x00000000023ebe9d: test %eax,-0x21fbea3(%rip)
0x00000000023ebea3: vucomisd %xmm0,%xmm1 <-- compare registers
0x00000000023ebea7: ja 0x00000000023ebe90

As you can see, the 'primitive' case makes a number of loads and stores to a stack location while 'wrapper' does mostly in-register operations. It is quite understandable why OSR stub refers to stack: in the interpreted mode local variables are stored on the stack, and OSR stub is made compatible with this interpreted frame. In a 'wrapper' case the value is stored on the heap, and the reference to the object is already cached in a register.

will gcc optimization remove for loop if it's only one iteration?

I think I have to use volatile keyword defining loop variable since data processing will be using pointers to Inputs/Outputs.

No, that doesn't make any sense. Only the input/output hardware registers themselves should be volatile. Pointers to them should be declared as pointer-to-volatile data, ie volatile uint8_t*. There is no need to make the pointer itself volatile, ie uint8_t* volatile //wrong.

As things stand now, you force the compiler to create a variable i and increase it, which will likely block loop unrolling optimizations.

Trying your code on gcc x86 with -O3 this is exactly what happens. No matter the size of BLOCKSIZE, it still generates the loop because of volatile. If I drop volatile it completely unrolls the loop up to BLOCKSIZE == 7 and replace it with a number of function calls. Beyond 8 it creates a loop (but keeps the iterator in a register instead of RAM).

x86 example:

for (int i=0; i<5; i++)
{
foo();
}

gives

    call    foo
call foo
call foo
call foo
call foo

But

for (volatile int i=0; i<5; i++)
{
foo();
}

gives way more inefficient

        mov     DWORD PTR [rsp+12], 0
mov eax, DWORD PTR [rsp+12]
cmp eax, 4
jg .L2
.L3:
call foo
mov eax, DWORD PTR [rsp+12]
add eax, 1
mov DWORD PTR [rsp+12], eax
mov eax, DWORD PTR [rsp+12]
cmp eax, 4
jle .L3
.L2:

For further study of the correct use of volatile in embedded systems, please see:

  • How to access a hardware register from firmware?
  • Using volatile in embedded C development

Is optimizing JavaScript for loops really necessary?

First, I should say that this answer is written in 2011 and these things change over time (as browser interpreters optimize more and more things) so if you really want to know the current state of the world, you have to run tests on current browsers.

Run your own jsperf test on any version of IE. There you will see a consistent difference between the two methods or many other old browsers. You apparently only ran it on Chrome which is so fast and so optimized that there is a negligible difference between the two methods. On IE9 (which is likely way better than IE7 and IE8), the method which pre-caches the length is 31% faster.

A jsperf test designed for this question gives quantitative results on this question. In questions like this one should just go to jsperf to see what the real difference is rather than so much speculation.

It shows a difference in the browsers I tried that ranges from almost no difference to a pretty sizable difference depending upon the browser. In Chrome, there's almost no difference. In IE9, storing the length first is almost 50% faster.

Now, whether this speed difference matters to your scripts depends on the specific code. If you had a huge array that you were looping through frequently, it could make a meaningful difference in some browsers to use this form:

for (var i = 0, len = list.length; i < len; i++) {
// do code here
}

In a slightly different test case when using live pseudo arrays returned by some DOM functions, there was still a difference in speed, but not as magnified (I expected the difference to be greater on DOM pseudo live arrays, but it wasn't).

In practice, I tend to use the short version (less typing) when I don't think my section of code is speed critical and/or the array is not large and I would use the longer version that pre-caches the length if I am consciously thinking about speed or the array is huge or I'm doing a lot of iterations over the same array.

There are a couple other programming reasons to pre-cache the length. If you will be adding elements to the end of the array during the loop and you don't want to the loop to iterate over those newly added elements, then you will NEED to pre-load the length and only iterate over the initially present elements.

for (var i = 0, len = list.length; i < len; i++) {
if (list[i] == "whatever") {
list.push("something");
}
}

Keep in mind that browsers are continually evolving and adding more and more optimizations so an optimization that shows great benefit in 2011 may be essentially built into a more modern browser in the future so the hand coded optimization is no longer needed. So, if you're trying to optimize something for today's performance, you have to test in today's browsers, you can't just rely on things you read that may be a few years old.

C - how would compiler optimization affect a for loop with no body?

It depends on the type of i. If it is just a plain integer type that isn't used apart from inside the loop, there are no side effects and the compiler is free to optimize away the whole thing.

If you declare i as volatile however, the compiler is forced to generate code that increments the variable and reads it, at each lap of the loop.

This is one of many reasons why you should not use "burn-away" loops like these in embedded systems. You also occupy 100% CPU and consume 100% current. And you create a tight coupling between your system clock and the loop, which isn't necessarily linear.

The professional solution is always to use an on-chip hardware timer instead of "burn-away" loops.

Javascript for loop optimization

The important thing to recognize here is that the test to see if the loop should stop executing - the second part of a for loop declaration, the i-- - evaluates to an expression which can be tested for truthyness/falseyness:

let i = 2;console.log(i--);console.log(i--);console.log(i--);

Python for loop optimization

Considering that the bottleneck is obtaining the combinations of ids, why not leave it until the very end?

Group the data by id, each id will show a set of the "bins" (day, cluster) where it is found:

grouped = collections.defaultdict(set)
for index, (id_, day, cluster) in df.iterrows():
grouped[id_].add((day, cluster))

For each bin combination found, make a list of id's that belong to each one:

binned = collections.defaultdict(list)
for id_, bins in grouped.items():
binned[tuple(sorted(bins))].append(id_)

Simplify only by cluster if that is what you need:

clustered = collections.defaultdict(list)
for bins, ids in binned.items():
clusters = set(cluster for (day, cluster) in bins)
clustered[tuple(sorted(clusters))].extend(ids)

And finally, getting the combinations of ids for each cluster bin shouldn't be a problem:

for bins, ids in clustered.items():
if len(ids) > 1:
for comb_id in itertools.combinations(ids, 2):
print(bins, comb_id)
# or do other stuff with it

C loop optimization help for final assignment (with compiler optimization disabled)

You may be on the right track, though you'll need to measure it to be certain (my normal advice to measure, not guess seems a little superfluous here since the whole point of the assignment is to measure).

Optimising compilers will probably not see much of a difference since they're pretty clever about that sort of stuff but, since we don't know what optimisation level it will be compiling at, you may get a substantial improvement.

To use pointers in the inner loop is a simple matter of first adding a pointer variable:

register double *pj;

then changing the loop to:

for (pj = &(array[0]); pj < &(array[ARRAY_SIZE]); j++) {
sum += *j++;
sum1 += *j++;
sum2 += *j++;
sum3 += *j++;
sum4 += *j++;
sum5 += *j++;
sum6 += *j++;
sum7 += *j++;
sum8 += *j++;
sum9 += *j;
}

This keeps the amount of additions the same within the loop (assuming you're counting += and ++ as addition operators, of course) but basically uses pointers rather than array indexes.

With no optimisation1 on my system, this drops it from 9.868 seconds (CPU time) to 4.84 seconds. Your mileage may vary.


1 With optimisation level -O3, both are reported as taking 0.001 seconds so, as mentioned, the optimisers are pretty clever. However, given you're seeing 5+ seconds, I'd suggest it wasn't been compiled with optimisation on.

As an aside, this is a good reason why it's usually advisable to write your code in a readable manner and let the compiler take care of getting it running faster. While my meager attempts at optimisation roughly doubled the speed, using -O3 made it run some ten thousand times faster :-)



Related Topics



Leave a reply



Submit