Simple For() Loop Benchmark Takes the Same Time with Any Loop Bound

For-loop with and without an empty body takes same amount of time

The answer is not because of optimization (or solely thereof - the byte code is generated). It is because the loop is a floating point loop and floating point operations are notoriously expensive. So the internal code is relatively faster or insignificant to the overall running time (as was alluded to by @j11john).

If you change the loop so it is int based, the two forms will run at significantly different speeds.

        long start = System.nanoTime();
int count = (int)Math.pow(10, 9);
int x = 0;
for(int i = 0;i<count;i++) {
int y = 10;
x = 3*y;
x = 100*y*x;
int[] arr = new int[3];
arr[0] = 1;

}
System.out.println((System.nanoTime()-start)/1_000_000_000.);
}

8.4248E-5 vs 0.047076639 The latter time with the extra code.

So the similarity in both running times (as observed by the OP) is related to the nature of the loop (int vs double) and has very little to do with the code within the loop.

Idiomatic way of performance evaluation?

Generally: For repeated short things, you can just time the whole repeat loop. (But microbenchmarking is hard; easy to distort results unless you understand the implications of doing that; for very short things, throughput and latency are different, so measure both separately by making one iteration use the result of the previous or not. Also beware that branch prediction and caching can make something look fast in a microbenchmark when it would actually be costly if done one at a time between other work in a larger program.
e.g. loop unrolling and lookup tables often look good because there's no pressure on I-cache or D-cache from anything else.)

Or if you insist on timing each separate iteration, record the results in an array and print later; you don't want to invoke heavy-weight printing code inside your loop.

This question is way too broad to say anything more specific.

Many languages have benchmarking packages that will help you write microbenchmarks of a single function. Use them. e.g. for Java, JMH makes sure the function under test is warmed up and fully optimized by the JIT, and all that jazz, before doing timed runs. And runs it for a specified interval, counting how many iterations it completes. See How do I write a correct micro-benchmark in Java? for that and more.



Beware common microbenchmark pitfalls

  • Failure to warm up code / data caches and stuff: page faults within the timed region for touching new memory, or code / data cache misses, that wouldn't be part of normal operation. (Example of noticing this effect: Performance: memset; or example of a wrong conclusion based on this mistake)

  • Never-written memory (obtained fresh from the kernel) gets all its pages copy-on-write mapped to the same system-wide physical page (4K or 2M) of zeros if you read without writing, at least on Linux. So you can get cache hits but TLB misses. e.g. A large allocation from new / calloc / malloc, or a zero-initialized array in static storage in .bss. Use a non-zero initializer or memset.

  • Failure to give the CPU time to ramp up to max turbo: modern CPUs clock down to idle speeds to save power, only clocking up after a few milliseconds. (Or longer depending on the OS / HW).

    related: on modern x86, RDTSC counts reference cycles, not core clock cycles, so it's subject to the same CPU-frequency variation effects as wall-clock time.

  • Most integer and FP arithmetic asm instructions (except divide and square root which are already slower than others) have performance (latency and throughput) that doesn't depend on the actual data. Except for subnormal aka denormal floating point being very slow, and in some cases (e.g. legacy x87 but not SSE2) also producing NaN or Inf can be slow.

  • On modern CPUs with out-of-order execution, some things are too short to truly time meaningfully, see also this. Performance of a tiny block of assembly language (e.g. generated by a compiler for one function) can't be characterized by a single number, even if it doesn't branch or access memory (so no chance of mispredict or cache miss). It has latency from inputs to outputs, but different throughput if run repeatedly with independent inputs is higher. e.g. an add instruction on a Skylake CPU has 4/clock throughput, but 1 cycle latency. So dummy = foo(x) can be 4x faster than x = foo(x); in a loop. Floating-point instructions have higher latency than integer, so it's often a bigger deal. Memory access is also pipelined on most CPUs, so looping over an array (address for next load easy to calculate) is often much faster than walking a linked list (address for next load isn't available until the previous load completes).

    Obviously performance can differ between CPUs; in the big picture usually it's rare for version A to be faster on Intel, version B to be faster on AMD, but that can easily happen in the small scale. When reporting / recording benchmark numbers, always note what CPU you tested on.

  • Related to the above and below points: you can't "benchmark the * operator" in C in general, for example. Some use-cases for it will compile very differently from others, e.g. tmp = foo * i; in a loop can often turn into tmp += foo (strength reduction), or if the multiplier is a constant power of 2 the compiler will just use a shift. The same operator in the source can compile to very different instructions, depending on surrounding code.

  • You need to compile with optimization enabled, but you also need to stop the compiler from optimizing away the work, or hoisting it out of a loop. Make sure you use the result (e.g. print it or store it to a volatile) so the compiler has to produce it. For an array, volatile double sink = output[argc]; is a useful trick: the compiler doesn't know the value of argc so it has to generate the whole array, but you don't need to read the whole array or even call an RNG function. (Unless the compiler aggressively transforms to only calculate the one output selected by argc, but that tends not to be a problem in practice.)

    For inputs, use a random number or argc or something instead of a compile-time constant so your compiler can't do constant-propagation for things that won't be constants in your real use-case. In C you can sometimes use inline asm or volatile for this, e.g. the stuff this question is asking about. A good benchmarking package like Google Benchmark will include functions for this.

  • If the real use-case for a function lets it inline into callers where some inputs are constant, or the operations can be optimized into other work, it's not very useful to benchmark it on its own.

  • Big complicated functions with special handling for lots of special cases can look fast in a microbenchmark when you run them repeatedly, especially with the same input every time. In real life use-cases, branch prediction often won't be primed for that function with that input. Also, a massively unrolled loop can look good in a microbenchmark, but in real life it slows everything else down with its big instruction-cache footprint leading to eviction of other code.

Related to that last point: Don't tune only for huge inputs, if the real use-case for a function includes a lot of small inputs. e.g. a memcpy implementation that's great for huge inputs but takes too long to figure out which strategy to use for small inputs might not be good. It's a tradeoff; make sure it's good enough for large inputs (for an appropriate definition of "enough"), but also keep overhead low for small inputs.

Litmus tests:

  • If you're benchmarking two functions in one program: if reversing the order of testing changes the results, your benchmark isn't fair. e.g. function A might only look slow because you're testing it first, with insufficient warm-up. example: Why is std::vector slower than an array? (it's not, whichever loop runs first has to pay for all the page faults and cache misses; the 2nd just zooms through filling the same memory.)

  • Increasing the iteration count of a repeat loop should linearly increase the total time, and not affect the calculated time-per-call. If not, then you have non-negligible measurement overhead or your code optimized away (e.g. hoisted out of the loop and runs only once instead of N times).

  • Vary other test parameters as a sanity check.


For C / C++, see also Simple for() loop benchmark takes the same time with any loop bound where I went into some more detail about microbenchmarking and using volatile or asm to stop important work from optimizing away with gcc/clang.

Benchmarking (execution time of a for loop): shouldn't we have an monotonic increasing function with the increase of the limit of the for loop

Hypothesis:

A hypothesis is an explanation for a phenomenon. Since you're putting this forth before actually observing any phenomenon, it cannot be a hypothesis. Additionally if, as you claim, it directly contradicts the observations, it cannot work as an explanation for them, hence not a hypothesis.

With the increasing limit the time should increase. We should have a Strictly Increasing Function

And as far as I can tell the observation does not contradict this.

The artifacts in the curve are easily explained by the clock's finite resolution: it clearly cannot measure time differences smaller than 1e-5; variations smaller than 1e-5 will show as flat lines or discrete 1e-5 changes.

looping over array, performance difference between indexed and enhanced for loop

TL;DR: You are observing what happens when JIT compiler cannot trust that values are not changing inside the loop. Additionally, in the tiny benchmark like this, Blackhole.consume costs dominate, obscuring the results.

Simplifying the test:

@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(3)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class JdkBenchmarks {

public int[] values;

@Setup
public void setupArray() {
int count = 1000;
values = new int[count];
for(int i = 0; i < count; i++) {
values[i] = i;
}
}

@Benchmark
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public void indexed(Blackhole bh) {
for(int i = 0; i < values.length; i++) {
bh.consume(values[i]);
}
}

@Benchmark
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public void indexed_cached(Blackhole bh) {
int[] vs = values;
int length = vs.length;
for(int i = 0; i < length; i++) {
bh.consume(vs[i]);
}
}

@Benchmark
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public void enhanced(Blackhole bh) {
for (int value : values) {
bh.consume(value);
}
}
}

Running both enhanced and indexed_cached under -prof perfasm reveals this hot loop (I specifically did @CompilerControl(DONT_INLINE) to let @Benchmark method be compiled alone, which makes an easier to digest perfasm output):

         ↗  0x...4240: mov  0x10(%r8,%rsi,4),%r10d  ; load values[i], blackhole it
22.68% │ 0x...4245: mov 0x14(%r8,%rsi,4),%r11d ; ... repeat 7 more times...
│ 0x...424a: mov 0x18(%r8,%rsi,4),%r10d ;
20.95% │ 0x...424f: mov 0x1c(%r8,%rsi,4),%r10d ;
0.02% │ 0x...4254: mov 0x20(%r8,%rsi,4),%r11d ;
24.73% │ 0x...4259: mov 0x24(%r8,%rsi,4),%r10d ;
0.24% │ 0x...425e: mov 0x28(%r8,%rsi,4),%r11d ;
20.04% │ 0x...4263: mov 0x2c(%r8,%rsi,4),%r10d ;
0.22% │ 0x...4268: add $0x8,%esi ; i += 8
│ 0x...426b: cmp %ebp,%esi ; compare i with length (in %ebp)
0.26% ╰ 0x...426d: jl 0x...4240 ; circle back if 8 more elements available

Very efficient!

Running indexed with -prof perfasm reveals:

         ↗  0x...4170: mov  0xc(%r12,%r8,8),%r9d    ; array bounds check, load values.length
3.42% │ 0x...4175: cmp %r9d,%r10d ; array bounds check, compare i
16.02% │ 0x...4178: jae 0x...41b1 ; failed? jump to exception handling
│ 0x...417a: lea (%r12,%r8,8),%r11 ; load values[i], part 1
0.04% │ 0x...417e: mov 0x10(%r11,%r10,4),%r11d ; load values[i], part 2
│ ; %r11d is blackholed
35.69% │ 0x...4183: mov 0xc(%rsi),%r8d ; get "values"
0.71% │ 0x...4187: mov 0x348(%r15),%r11 ; safepoint poll, part 1 (JVM infra)
4.03% │ 0x...418e: inc %r10d ; i++
0.12% │ 0x...4191: test %eax,(%r11) ; safepoint poll, part 2 (JVM infra)
27.74% │ 0x...4194: mov 0xc(%r12,%r8,8),%r9d ; load values.length
8.53% │ 0x...4199: cmp %r9d,%r10d ; check i < values.length
0.24% ╰ 0x...419c: jl 0x...4170 ; circle back if more

This is because Blackhole.consume call is opaque to the compiler (like many other non-inlined calls), so it has to conservatively presume that values can change in the middle of the loop!

Which means, compiler cannot stash values in a register, it cannot trust the array bounds check to always succeed, it cannot even guarantee the loop terminates (hence safepoint polls), and on top of that, loop unrolling does not want to multiply that per-element mess even more.

So you get the penalty like this (TR 3970X, JDK 17.0.2 EA, Linux x86_64):

Benchmark                     Mode  Cnt     Score   Error  Units
JdkBenchmarks.enhanced avgt 5 144.962 ± 0.918 ns/op
JdkBenchmarks.indexed avgt 5 1030.981 ± 3.775 ns/op ; + 880 ns/op!
JdkBenchmarks.indexed_cached avgt 5 144.799 ± 0.643 ns/op ; same as enhanced

Additional fun part:

On most JDKs the dominating costs are the costs of calling the Blackhole.consume in this test. Compared to the cost of array access, the cost of Java-style Blackhole is quite bad. With JDK 17+ and JMH 1.34, the compiler Blackholes would be used, and thus provide much more fidelity for the test.

Without compiler blackholes, the effect hides in the Blackhole overhead nearly completely (>25x overhead means we can execute a lot of bad code preceding the Blackhole call!):

Benchmark                     Mode  Cnt     Score   Error  Units
JdkBenchmarks.enhanced avgt 5 4062.866 ± 4.736 ns/op
JdkBenchmarks.indexed avgt 5 4071.620 ± 1.057 ns/op ; + 10 ns/op [whoops]
JdkBenchmarks.indexed_cached avgt 5 4061.390 ± 0.692 ns/op ; same as enhanced

It would re-manifest if we drop @CompilerControl(DONT_INLINE), because the resulting generated code would be much messier:

Benchmark                     Mode  Cnt     Score    Error  Units
JdkBenchmarks.enhanced avgt 5 4067.118 ± 40.699 ns/op
JdkBenchmarks.indexed avgt 5 4601.370 ± 0.632 ns/op ; + 530 ns/op
JdkBenchmarks.indexed_cached avgt 5 4064.455 ± 1.554 ns/op ; same as enhanced

Is there a performance difference between a for loop and a for-each loop?

From Item 46 in Effective Java by Joshua Bloch :

The for-each loop, introduced in
release 1.5, gets rid of the clutter
and the opportunity for error by
hiding the iterator or index variable
completely. The resulting idiom
applies equally to collections and
arrays:

// The preferred idiom for iterating over collections and arrays
for (Element e : elements) {
doSomething(e);
}

When you see the colon (:), read it as
“in.” Thus, the loop above reads as
“for each element e in elements.” Note
that there is no performance penalty
for using the for-each loop, even for
arrays. In fact, it may offer a slight
performance advantage over an ordinary
for loop in some circumstances, as it
computes the limit of the array index
only once. While you can do this by
hand (Item 45), programmers don’t
always do so.

Javascript efficiency: 'for' vs 'forEach'

for

for loops are much more efficient. It is a looping construct specifically designed to iterate while a condition is true, at the same time offering a stepping mechanism (generally to increase the iterator). Example:

for (var i=0, n=arr.length; i < n; ++i ) {
...
}

This isn't to suggest that for-loops will always be more efficient, just that JS engines and browsers have optimized them to be so. Over the years there have been compromises as to which looping construct is more efficient (for, while, reduce, reverse-while, etc) -- different browsers and JS engines have their own implementations that offer different methodologies to produce the same results. As browsers further optimize to meet performance demands, theoretically [].forEach could be implemented in such a way that it's faster or comparable to a for.

Benefits:

  • efficient
  • early loop termination (honors break and continue)
  • condition control (i<n can be anything and not bound to an array's size)
  • variable scoping (var i leaves i available after the loop ends)

forEach

.forEach are methods that primarily iterate over arrays (also over other enumerable, such as Map and Set objects). They are newer and provide code that is subjectively easier to read. Example:

[].forEach((val, index)=>{
...
});

Benefits:

  • does not involve variable setup (iterates over each element of the array)
  • functions/arrow-functions scope the variable to the block

    In the example above, val would be a parameter of the newly created function. Thus, any variables called val before the loop, would hold their values after it ends.
  • subjectively more maintainable as it may be easier to identify what the code is doing -- it's iterating over an enumerable; whereas a for-loop could be used for any number of looping schemes

Performance

Performance is a tricky topic, which generally requires some experience when it comes to forethought or approach. In order to determine ahead of time (while developing) how much optimization may be required, a programmer must have a good idea of past experience with the problem case, as well as a good understanding of potential solutions.

Using jQuery in some cases may be too slow at times (an experienced developer may know that), whereas other times may be a non-issue, in which case the library's cross-browser compliance and ease of performing other functions (e.g., AJAX, event-handling) would be worth the development (and maintenance) time saved.

Another example is, if performance and optimization was everything, there would be no other code than machine or assembly. Obviously that isn't the case as there are many different high level and low level languages, each with their own tradeoffs. These tradeoffs include, but are not limited to specialization, development ease and speed, maintenance ease and speed, optimized code, error free code, etc.

Approach

If you don't have a good understanding if something will require optimized code, it's generally a good rule of thumb to write maintainable code first. From there, you can test and pinpoint what needs more attention when it's required.

That said, certain obvious optimizations should be part of general practice and not required any thought. For instance, consider the following loop:

for (var i=0; i < arr.length; ++i ){}

For each iteration of the loop, JavaScript is retrieving the arr.length, a key-lookup costing operations on each cycle. There is no reason why this shouldn't be:

for (var i=0, n=arr.length; i < n; ++i){}

This does the same thing, but only retrieves arr.length once, caching the variable and optimizing your code.

Why is looping over range() in Python faster than using a while loop?

see the disassembly of python byte code, you may get a more concrete idea

use while loop:

1           0 LOAD_CONST               0 (0)
3 STORE_NAME 0 (i)

2 6 SETUP_LOOP 28 (to 37)
>> 9 LOAD_NAME 0 (i) # <-
12 LOAD_CONST 1 (100000000) # <-
15 COMPARE_OP 0 (<) # <-
18 JUMP_IF_FALSE 14 (to 35) # <-
21 POP_TOP # <-

3 22 LOAD_NAME 0 (i) # <-
25 LOAD_CONST 2 (1) # <-
28 INPLACE_ADD # <-
29 STORE_NAME 0 (i) # <-
32 JUMP_ABSOLUTE 9 # <-
>> 35 POP_TOP
36 POP_BLOCK

The loop body has 10 op

use range:

1           0 SETUP_LOOP              23 (to 26)
3 LOAD_NAME 0 (range)
6 LOAD_CONST 0 (0)
9 LOAD_CONST 1 (100000000)
12 CALL_FUNCTION 2
15 GET_ITER
>> 16 FOR_ITER 6 (to 25) # <-
19 STORE_NAME 1 (n) # <-

2 22 JUMP_ABSOLUTE 16 # <-
>> 25 POP_BLOCK
>> 26 LOAD_CONST 2 (None)
29 RETURN_VALUE

The loop body has 3 op

The time to run C code is much shorter than intepretor and can be ignored.



Related Topics



Leave a reply



Submit