Unrolled Loop Works, for Loop Does Not Work

Unrolled Loop works, for loop does not work

You're capturing the variable i in your lambda expression. When that lambda expression is executed, it will use the "current" value of i - which will always be 9. You want to capture a copy of the variable... which you can do be introducing a new variable in the loop:

for (int i = 0; i < teamButtons.Length; i++)
{
    int index = i;
    teamButtons[i].onClick.AddListener(() => SetCard(c[index]));
}

Unrolling For Loop in C

I'd say simply add the lines for i+1. But you have to be sure they are in the right order, so:

 for(i=0; i<100; i+=2){
    x[i] = y[i] + z[i];
    z[i] = y[i] + a[i];
    z[i+1] = y[i] * a[i];

    // next iteration
    if (i+1 < 100) {
          x[i+1] = y[i+1] + z[i+1];
          z[i+1] = y[i+1] + a[i+1];
          z[i+2] = y[i+1] * a[i+1]; 
    }
 }

EDIT

to make it safe for all upper bounds (not only even) you have to add an if in the loop

As Adrian Mole mentions, it could be better to check the upper bound in the first place or set the size of the array conveniently

Java: manually-unrolled loop is still faster than the original loop. Why?

TL;DR The main reason of performance difference here is not related to loop unrolling. It is rather the type speculation and the inline caches.

Unrolling strategies

In fact, in HotSpot terminology, such loops are treated as counted, and in certain cases JVM can unroll them. Not in your case though.

HotSpot has two loop unrolling strategies: 1) unroll maximally, i.e. remove the loop altogether; or 2) glue several consecutive iterations together.

Maximal unrolling can be done, only if the exact number of iterations is known.

  if (!cl->has_exact_trip_count()) {
    // Trip count is not exact.
    return false;
  }

In your case, however, the function may return early after the first iteration.

Partial unrolling could be probably applied, but the following condition breaks unrolling:

  // Don't unroll if the next round of unrolling would push us
  // over the expected trip count of the loop.  One is subtracted
  // from the expected trip count because the pre-loop normally
  // executes 1 iteration.
  if (UnrollLimitForProfileCheck > 0 &&
      cl->profile_trip_cnt() != COUNT_UNKNOWN &&
      future_unroll_ct        > UnrollLimitForProfileCheck &&
      (float)future_unroll_ct > cl->profile_trip_cnt() - 1.0) {
    return false;
  }

Since in your case the expected trip count is less than 2, HotSpot assumes it's not worthy to unroll even two iterations. Note that the first iteration is extracted into pre-loop anyway (loop peeling optimization), so unrolling is indeed not very benificial here.

Type speculation

In your unrolled version, there are two different invokeinterface bytecodes. These sites have two distinct type profiles. The first receiver is always Filter1, and the second receiver is always Filter2. So, you basically have two monomorphic call sites, and HotSpot can perfectly inline both calls - so called "inline cache" which has 100% hit ratio in this case.

With the loop, there is just one invokeinterface bytecode, and only one type profile is collected. HotSpot JVM sees that filters[j].isOK() is called 86% times with Filter1 receiver and 14% times with Filter2 receiver. This will be a bimorphic call. Fortunately, HotSpot can speculatively inline bimorphic calls, too. It inlines both targets with a conditional branch. However, in this case the hit ratio will be at most 86%, and the performance will suffer from the corresponding mispredicted branches at the architecture level.

Things will be even worse, if you have 3 or more different filters. In this case isOK() will be a megamorphic call which HotSpot cannot inline at all. So, the compiled code will contain a true interface call which has a larger performance impact.

More about speculative inlining in the article The Black Magic of (Java) Method Dispatch.

Conclusion

In order to inline virtual/interface calls, HotSpot JVM collects type profiles per invoke bytecode. If there is a virtual call in a loop, there will be just one type profile for the call, no matter if the loop is unrolled or not.

To get the best from the virtual call optimizations, you'd need to manually split the loop, primarily for the purpose of splitting type profiles. HotSpot cannot do this automatically so far.

Manual loop unrolling within a C++ Introsort Runs Incorrectly

This example code works, about 11% faster in 64 bit mode (more registers). The compiler optimized the compare and conditional copy of pj[...] via tmp to use a register (and it cycled through registers to allow some overlap).

int * Partition(int *plo, int *phi)
{
    int *pi = plo;
    int *pj = plo;
    int pvt = *phi;
    int tmp;
    int *ph8 = phi - 8;
    for (pj = plo; pj < ph8; pj += 8)
    {
        if (pj[0] < pvt)
        {
            tmp = pj[0];
            pj[0] = *pi;
            *pi = tmp;
            ++pi;
        }
        if (pj[1] < pvt)
        {
            tmp = pj[1];
            pj[1] = *pi;
            *pi = tmp;
            ++pi;
        }
        if (pj[2] < pvt)
        {
            tmp = pj[2];
            pj[2] = *pi;
            *pi = tmp;
            ++pi;
        }
        if (pj[3] < pvt)
        {
            tmp = pj[3];
            pj[3] = *pi;
            *pi = tmp;
            ++pi;
        }
        if (pj[4] < pvt)
        {
            tmp = pj[4];
            pj[4] = *pi;
            *pi = tmp;
            ++pi;
        }
        if (pj[5] < pvt)
        {
            tmp = pj[5];
            pj[5] = *pi;
            *pi = tmp;
            ++pi;
        }
        if (pj[6] < pvt)
        {
            tmp = pj[6];
            pj[6] = *pi;
            *pi = tmp;
            ++pi;
        }
        if (pj[7] < pvt)
        {
            tmp = pj[7];
            pj[7] = *pi;
            *pi = tmp;
            ++pi;
        }
    }
    for (; pj < phi; ++pj)
    {
        if (*pj < pvt)
        {
            tmp = *pj;
            *pj = *pi;
            *pi = tmp;
            ++pi;
        }
    }
    tmp  = *phi;
    *phi = *pi;
    *pi  = tmp;
    return pi;
}

void QuickSort(int *plo, int *phi)
{
int *p;
    if (plo < phi)
    {
        p = Partition(plo, phi);
        QuickSort(plo, p-1);
        QuickSort(p+1, phi);
    }
}

How does a pipelined processor handle excessively unrolled loops?

Pete's right about the code missing the last 3 elements. Regarding your question about the pipeline...

Pipelined processors have a heavy branch penalty, they do not like excessively tight loops. At the assembly level, a pipelined processor will have access to repeat single and repeat block instructions.

In a high level language, unrolling can be performed by the compiler or by the developer. In the code you listed, the developer code performs 4 sums per each cycle.

Unrolled Loop Works, for Loop Does Not Work