Java Benchmarking - Why Is the Second Loop Faster

Java benchmarking - why is the second loop faster?

This is a classic java benchmarking issue. Hotspot/JIT/etc will compile your code as you use it, so it gets faster during the run.

Run around the loop at least 3000 times (10000 on a server or on 64 bit) first - then do your measurements.

Why is 2 * (i * i) faster than 2 * i * i in Java?

There is a slight difference in the ordering of the bytecode.

2 * (i * i):

     iconst_2
iload0
iload0
imul
imul
iadd

vs 2 * i * i:

     iconst_2
iload0
imul
iload0
imul
iadd

At first sight this should not make a difference; if anything the second version is more optimal since it uses one slot less.

So we need to dig deeper into the lower level (JIT)1.

Remember that JIT tends to unroll small loops very aggressively. Indeed we observe a 16x unrolling for the 2 * (i * i) case:

030   B2: # B2 B3 <- B1 B2  Loop: B2-B2 inner main of N18 Freq: 1e+006
030 addl R11, RBP # int
033 movl RBP, R13 # spill
036 addl RBP, #14 # int
039 imull RBP, RBP # int
03c movl R9, R13 # spill
03f addl R9, #13 # int
043 imull R9, R9 # int
047 sall RBP, #1
049 sall R9, #1
04c movl R8, R13 # spill
04f addl R8, #15 # int
053 movl R10, R8 # spill
056 movdl XMM1, R8 # spill
05b imull R10, R8 # int
05f movl R8, R13 # spill
062 addl R8, #12 # int
066 imull R8, R8 # int
06a sall R10, #1
06d movl [rsp + #32], R10 # spill
072 sall R8, #1
075 movl RBX, R13 # spill
078 addl RBX, #11 # int
07b imull RBX, RBX # int
07e movl RCX, R13 # spill
081 addl RCX, #10 # int
084 imull RCX, RCX # int
087 sall RBX, #1
089 sall RCX, #1
08b movl RDX, R13 # spill
08e addl RDX, #8 # int
091 imull RDX, RDX # int
094 movl RDI, R13 # spill
097 addl RDI, #7 # int
09a imull RDI, RDI # int
09d sall RDX, #1
09f sall RDI, #1
0a1 movl RAX, R13 # spill
0a4 addl RAX, #6 # int
0a7 imull RAX, RAX # int
0aa movl RSI, R13 # spill
0ad addl RSI, #4 # int
0b0 imull RSI, RSI # int
0b3 sall RAX, #1
0b5 sall RSI, #1
0b7 movl R10, R13 # spill
0ba addl R10, #2 # int
0be imull R10, R10 # int
0c2 movl R14, R13 # spill
0c5 incl R14 # int
0c8 imull R14, R14 # int
0cc sall R10, #1
0cf sall R14, #1
0d2 addl R14, R11 # int
0d5 addl R14, R10 # int
0d8 movl R10, R13 # spill
0db addl R10, #3 # int
0df imull R10, R10 # int
0e3 movl R11, R13 # spill
0e6 addl R11, #5 # int
0ea imull R11, R11 # int
0ee sall R10, #1
0f1 addl R10, R14 # int
0f4 addl R10, RSI # int
0f7 sall R11, #1
0fa addl R11, R10 # int
0fd addl R11, RAX # int
100 addl R11, RDI # int
103 addl R11, RDX # int
106 movl R10, R13 # spill
109 addl R10, #9 # int
10d imull R10, R10 # int
111 sall R10, #1
114 addl R10, R11 # int
117 addl R10, RCX # int
11a addl R10, RBX # int
11d addl R10, R8 # int
120 addl R9, R10 # int
123 addl RBP, R9 # int
126 addl RBP, [RSP + #32 (32-bit)] # int
12a addl R13, #16 # int
12e movl R11, R13 # spill
131 imull R11, R13 # int
135 sall R11, #1
138 cmpl R13, #999999985
13f jl B2 # loop end P=1.000000 C=6554623.000000

We see that there is 1 register that is "spilled" onto the stack.

And for the 2 * i * i version:

05a   B3: # B2 B4 <- B1 B2  Loop: B3-B2 inner main of N18 Freq: 1e+006
05a addl RBX, R11 # int
05d movl [rsp + #32], RBX # spill
061 movl R11, R8 # spill
064 addl R11, #15 # int
068 movl [rsp + #36], R11 # spill
06d movl R11, R8 # spill
070 addl R11, #14 # int
074 movl R10, R9 # spill
077 addl R10, #16 # int
07b movdl XMM2, R10 # spill
080 movl RCX, R9 # spill
083 addl RCX, #14 # int
086 movdl XMM1, RCX # spill
08a movl R10, R9 # spill
08d addl R10, #12 # int
091 movdl XMM4, R10 # spill
096 movl RCX, R9 # spill
099 addl RCX, #10 # int
09c movdl XMM6, RCX # spill
0a0 movl RBX, R9 # spill
0a3 addl RBX, #8 # int
0a6 movl RCX, R9 # spill
0a9 addl RCX, #6 # int
0ac movl RDX, R9 # spill
0af addl RDX, #4 # int
0b2 addl R9, #2 # int
0b6 movl R10, R14 # spill
0b9 addl R10, #22 # int
0bd movdl XMM3, R10 # spill
0c2 movl RDI, R14 # spill
0c5 addl RDI, #20 # int
0c8 movl RAX, R14 # spill
0cb addl RAX, #32 # int
0ce movl RSI, R14 # spill
0d1 addl RSI, #18 # int
0d4 movl R13, R14 # spill
0d7 addl R13, #24 # int
0db movl R10, R14 # spill
0de addl R10, #26 # int
0e2 movl [rsp + #40], R10 # spill
0e7 movl RBP, R14 # spill
0ea addl RBP, #28 # int
0ed imull RBP, R11 # int
0f1 addl R14, #30 # int
0f5 imull R14, [RSP + #36 (32-bit)] # int
0fb movl R10, R8 # spill
0fe addl R10, #11 # int
102 movdl R11, XMM3 # spill
107 imull R11, R10 # int
10b movl [rsp + #44], R11 # spill
110 movl R10, R8 # spill
113 addl R10, #10 # int
117 imull RDI, R10 # int
11b movl R11, R8 # spill
11e addl R11, #8 # int
122 movdl R10, XMM2 # spill
127 imull R10, R11 # int
12b movl [rsp + #48], R10 # spill
130 movl R10, R8 # spill
133 addl R10, #7 # int
137 movdl R11, XMM1 # spill
13c imull R11, R10 # int
140 movl [rsp + #52], R11 # spill
145 movl R11, R8 # spill
148 addl R11, #6 # int
14c movdl R10, XMM4 # spill
151 imull R10, R11 # int
155 movl [rsp + #56], R10 # spill
15a movl R10, R8 # spill
15d addl R10, #5 # int
161 movdl R11, XMM6 # spill
166 imull R11, R10 # int
16a movl [rsp + #60], R11 # spill
16f movl R11, R8 # spill
172 addl R11, #4 # int
176 imull RBX, R11 # int
17a movl R11, R8 # spill
17d addl R11, #3 # int
181 imull RCX, R11 # int
185 movl R10, R8 # spill
188 addl R10, #2 # int
18c imull RDX, R10 # int
190 movl R11, R8 # spill
193 incl R11 # int
196 imull R9, R11 # int
19a addl R9, [RSP + #32 (32-bit)] # int
19f addl R9, RDX # int
1a2 addl R9, RCX # int
1a5 addl R9, RBX # int
1a8 addl R9, [RSP + #60 (32-bit)] # int
1ad addl R9, [RSP + #56 (32-bit)] # int
1b2 addl R9, [RSP + #52 (32-bit)] # int
1b7 addl R9, [RSP + #48 (32-bit)] # int
1bc movl R10, R8 # spill
1bf addl R10, #9 # int
1c3 imull R10, RSI # int
1c7 addl R10, R9 # int
1ca addl R10, RDI # int
1cd addl R10, [RSP + #44 (32-bit)] # int
1d2 movl R11, R8 # spill
1d5 addl R11, #12 # int
1d9 imull R13, R11 # int
1dd addl R13, R10 # int
1e0 movl R10, R8 # spill
1e3 addl R10, #13 # int
1e7 imull R10, [RSP + #40 (32-bit)] # int
1ed addl R10, R13 # int
1f0 addl RBP, R10 # int
1f3 addl R14, RBP # int
1f6 movl R10, R8 # spill
1f9 addl R10, #16 # int
1fd cmpl R10, #999999985
204 jl B2 # loop end P=1.000000 C=7419903.000000

Here we observe much more "spilling" and more accesses to the stack [RSP + ...], due to more intermediate results that need to be preserved.

Thus the answer to the question is simple: 2 * (i * i) is faster than 2 * i * i because the JIT generates more optimal assembly code for the first case.


But of course it is obvious that neither the first nor the second version is any good; the loop could really benefit from vectorization, since any x86-64 CPU has at least SSE2 support.

So it's an issue of the optimizer; as is often the case, it unrolls too aggressively and shoots itself in the foot, all the while missing out on various other opportunities.

In fact, modern x86-64 CPUs break down the instructions further into micro-ops (µops) and with features like register renaming, µop caches and loop buffers, loop optimization takes a lot more finesse than a simple unrolling for optimal performance. According to Agner Fog's optimization guide:

The gain in performance due to the µop cache can be quite
considerable if the average instruction length is more than 4 bytes.
The following methods of optimizing the use of the µop cache may
be considered:

  • Make sure that critical loops are small enough to fit into the µop cache.
  • Align the most critical loop entries and function entries by 32.
  • Avoid unnecessary loop unrolling.
  • Avoid instructions that have extra load time

    . . .

Regarding those load times - even the fastest L1D hit costs 4 cycles, an extra register and µop, so yes, even a few accesses to memory will hurt performance in tight loops.

But back to the vectorization opportunity - to see how fast it can be, we can compile a similar C application with GCC, which outright vectorizes it (AVX2 is shown, SSE2 is similar)2:

  vmovdqa ymm0, YMMWORD PTR .LC0[rip]
vmovdqa ymm3, YMMWORD PTR .LC1[rip]
xor eax, eax
vpxor xmm2, xmm2, xmm2
.L2:
vpmulld ymm1, ymm0, ymm0
inc eax
vpaddd ymm0, ymm0, ymm3
vpslld ymm1, ymm1, 1
vpaddd ymm2, ymm2, ymm1
cmp eax, 125000000 ; 8 calculations per iteration
jne .L2
vmovdqa xmm0, xmm2
vextracti128 xmm2, ymm2, 1
vpaddd xmm2, xmm0, xmm2
vpsrldq xmm0, xmm2, 8
vpaddd xmm0, xmm2, xmm0
vpsrldq xmm1, xmm0, 4
vpaddd xmm0, xmm0, xmm1
vmovd eax, xmm0
vzeroupper

With run times:

  • SSE: 0.24 s, or 2 times as fast.
  • AVX: 0.15 s, or 3 times as fast.
  • AVX2: 0.08 s, or 5 times as fast.

1 To get JIT generated assembly output, get a debug JVM and run with -XX:+PrintOptoAssembly

2 The C version is compiled with the -fwrapv flag, which enables GCC to treat signed integer overflow as a two's-complement wrap-around.

Two operations in one loop vs two loops performing the same operations one per loop

If you don't run a warm-up phase, it is possible that the first loop gets optimised and compiled but not the second one, whereas when you merge them the whole merged loop gets compiled. Also, using the server option and your code, most gets optimised away as you don't use the results.

I have run the test below, putting each loop as well as the merged loop in their own method and warmimg-up the JVM to make sure everything gets compiled.

Results (JVM options: -server -XX:+PrintCompilation):

  • loop 1 = 500ms
  • loop 2 = 900 ms
  • merged loop = 1,300 ms

So the merged loop is slightly faster, but not that much.

public static void main(String[] args) throws InterruptedException {

for (int i = 0; i < 3; i++) {
loop1();
loop2();
loopBoth();
}

long start = System.nanoTime();

loop1();

long end = System.nanoTime();
System.out.println((end - start) / 1000000);

start = System.nanoTime();
loop2();
end = System.nanoTime();
System.out.println((end - start) / 1000000);

start = System.nanoTime();
loopBoth();
end = System.nanoTime();
System.out.println((end - start) / 1000000);
}

public static void loop1() {
int a = 188, aMax = 0;
for (int i = 0; i < 1000000000; i++) {
int t = a ^ i;
if (t > aMax) {
aMax = t;
}
}
System.out.println(aMax);
}

public static void loop2() {
int b = 144, bMax = 0;
for (int i = 0; i < 1000000000; i++) {
int t = b ^ i;
if (t > bMax) {
bMax = t;
}
}
System.out.println(bMax);
}

public static void loopBoth() {
int a = 188, b = 144, aMax = 0, bMax = 0;

for (int i = 0; i < 1000000000; i++) {
int t = a ^ i;
if (t > aMax) {
aMax = t;
}
int u = b ^ i;
if (u > bMax) {
bMax = u;
}
}
System.out.println(aMax);
System.out.println(bMax);
}

Running time of the same code blocks is different in java. why is that?

Faulty benchmarking. The non exhaustive list of what is wrong:

  • No warmup: single shot measurements are almost always wrong;
  • Mixing several codepaths in the single method: we probably start compiling the method with the execution data available only for the first loop in the method;
  • Sources are predictable: should the loop compile, we can actually predict the result;
  • Results are dead-code eliminated: should the loop compile, we can throw the loop it away

Here is how you do it arguably right with jmh:

@OutputTimeUnit(TimeUnit.NANOSECONDS)
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 3, time = 1)
@Measurement(iterations = 3, time = 1)
@Fork(10)
@State(Scope.Thread)
public class Longs {

public static final int COUNT = 10;

private Long[] refLongs;
private long[] primLongs;

/*
* Implementation notes:
* - copying the array from the field keeps the constant
* optimizations away, but we implicitly counting the
* costs of arraycopy() in;
* - two additional baseline experiments quantify the
* scale of arraycopy effects (note you can't directly
* subtract the baseline scores from the tests, because
* the code is mixed together;
* - the resulting arrays are always fed back into JMH
* to prevent dead-code elimination
*/

@Setup
public void setup() {
primLongs = new long[COUNT];
for (int i = 0; i < COUNT; i++) {
primLongs[i] = 12l;
}

refLongs = new Long[COUNT];
for (int i = 0; i < COUNT; i++) {
refLongs[i] = 12l;
}
}

@GenerateMicroBenchmark
public long[] prim_baseline() {
long[] d = new long[COUNT];
System.arraycopy(primLongs, 0, d, 0, COUNT);
return d;
}

@GenerateMicroBenchmark
public long[] prim_sort() {
long[] d = new long[COUNT];
System.arraycopy(primLongs, 0, d, 0, COUNT);
Arrays.sort(d);
return d;
}

@GenerateMicroBenchmark
public Long[] ref_baseline() {
Long[] d = new Long[COUNT];
System.arraycopy(refLongs, 0, d, 0, COUNT);
return d;
}

@GenerateMicroBenchmark
public Long[] ref_sort() {
Long[] d = new Long[COUNT];
System.arraycopy(refLongs, 0, d, 0, COUNT);
Arrays.sort(d);
return d;
}

}

...this yields:

Benchmark                   Mode   Samples         Mean   Mean error    Units
o.s.Longs.prim_baseline avgt 30 19.604 0.327 ns/op
o.s.Longs.prim_sort avgt 30 51.217 1.873 ns/op
o.s.Longs.ref_baseline avgt 30 16.935 0.087 ns/op
o.s.Longs.ref_sort avgt 30 25.199 0.430 ns/op

At this point you may start to wonder why sorting Long[] and sorting long[] takes different time. The answer lies in the Array.sort() overloads: OpenJDK sorts primitive and reference arrays via different algos (references with TimSort, primitives with dual-pivot quicksort). Here's the highlight of choosing another algo with -Djava.util.Arrays.useLegacyMergeSort=true, which falls back to merge sort for references:

Benchmark                   Mode   Samples         Mean   Mean error    Units
o.s.Longs.prim_baseline avgt 30 19.675 0.291 ns/op
o.s.Longs.prim_sort avgt 30 50.882 1.550 ns/op
o.s.Longs.ref_baseline avgt 30 16.742 0.089 ns/op
o.s.Longs.ref_sort avgt 30 64.207 1.047 ns/op

Hope that helps to explain the difference.

The explanation above barely scratch the surface about the performance of sorting. The performance is very different when presented with different source data (including available pre-sorted subsequences, their patterns and run lengths, sizes of the data itself).

Why is iteration is faster than Streams in Java?

Cannot reproduce with JMH:

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Benchmark)
public class Streams {
String[] words;

@Setup
public void setup() {
words = new String[10_000_000];
Arrays.fill(words, "hello world!");
}

@Benchmark
public long forLoop() {
long count = 0;
for (String word : words) {
if (word.length() > 8)
count++;
}
return count;
}

@Benchmark
public long stream() {
return Arrays.stream(words).filter(word -> word.length() > 8).count();
}

@Benchmark
public long parallelStream() {
return Arrays.stream(words).parallel().filter(word -> word.length() > 8).count();
}
}

Results:

Benchmark               Mode  Cnt   Score   Error  Units
Streams.forLoop avgt 5 8,408 ? 0,086 ms/op
Streams.parallelStream avgt 5 3,943 ? 0,353 ms/op
Streams.stream avgt 5 12,891 ? 0,064 ms/op

As expected, a parallel stream is faster than a for loop, and a for loop is faster than a non-parallel stream.

Use right tools to measure performance.

Which of these pieces of code is faster in Java?

When you get down to the lowest level (machine code but I'll use assembly since it maps one-to-one mostly), the difference between an empty loop decrementing to 0 and one incrementing to 50 (for example) is often along the lines of:

      ld  a,50                ld  a,0
loop: dec a loop: inc a
jnz loop cmp a,50
jnz loop

That's because the zero flag in most sane CPUs is set by the decrement instruction when you reach zero. The same can't usually be said for the increment instruction when it reaches 50 (since there's nothing special about that value, unlike zero). So you need to compare the register with 50 to set the zero flag.


However, asking which of the two loops:

for(int i = 100000; i > 0; i--) {}
for(int i = 1; i < 100001; i++) {}

is faster (in pretty much any environment, Java or otherwise) is useless since neither of them does anything useful. The fastest version of both those loops no loop at all. I challenge anyone to come up with a faster version than that :-)

They'll only become useful when you start doing some useful work inside the braces and, at that point, the work will dictate which order you should use.

For example if you need to count from 1 to 100,000, you should use the second loop. That's because the advantage of counting down (if any) is likely to be swamped by the fact that you have to evaluate 100000-i inside the loop every time you need to use it. In assembly terms, that would be the difference between:

     ld  b,100000             dsw a
sub b,a
dsw b

(dsw is, of course, the infamous do something with assembler mnemonic).

Since you'll only be taking the hit for an incrementing loop once per iteration, and you'll be taking the hit for the subtraction at least once per iteration (assuming you'll be using i, otherwise there's little need for the loop at all), you should just go with the more natural version.

If you need to count up, count up. If you need to count down, count down.

Why does a function run faster the more I call it?

Since your function is very fast (does nothing basically) what I think you're seeing here is the overhead of the setup and not any speedup in the runtime of the function. In this case before you start running your function you have to construct a range, construct an anonymous function and call the Enum.each function. For small numbers of repetitions these factors probably contribute more to the overall runtime of the benchmark than the actual repetitions.



Related Topics



Leave a reply



Submit