Unroll Loop, When to Use

When, if ever, is loop unrolling still useful?

Loop unrolling makes sense if you can break dependency chains. This gives a out of order or super-scalar CPU the possibility to schedule things better and thus run faster.

A simple example:

for (int i=0; i<n; i++)
{
  sum += data[i];
}

Here the dependency chain of the arguments is very short. If you get a stall because you have a cache-miss on the data-array the cpu cannot do anything but to wait.

On the other hand this code:

for (int i=0; i<n-3; i+=4)  // note the n-3 bound for starting i + 0..3
{
  sum1 += data[i+0];
  sum2 += data[i+1];
  sum3 += data[i+2];
  sum4 += data[i+3];
}
sum = sum1 + sum2 + sum3 + sum4;
// if n%4 != 0, handle final 0..3 elements with a rolled up loop or whatever

could run faster. If you get a cache miss or other stall in one calculation there are still three other dependency chains that don't depend on the stall. A out of order CPU can execute these in parallel.

(See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for an in-depth look at how register-renaming helps CPUs find that parallelism, and an in depth look at the details for FP dot-product on modern x86-64 CPUs with their throughput vs. latency characteristics for pipelined floating-point SIMD FMA ALUs. Hiding latency of FP addition or FMA is a major benefit to multiple accumulators, since latencies are longer than integer but SIMD throughput is often similar.)

Unroll Loop, when to use

What is Unroll-the-loop

See this Unroll the loop technique source:

This optimisation thechnique is used to optimize repeated alternation of the form (expr1|expr2|...)*. These expression are not uncommon, and the use of another repetition inside an alternation may also leads to super-linear match. Super-linear match arise from the underterministic expression (a*)*.

The unrolling the loop technique is based on the hypothesis that in most case, you kown in a repeteated alternation, which case should be the most usual and which one is exceptional. We will called the first one, the normal case and the second one, the special case. The general syntax of the unrolling the loop technique could then be written as:

normal* ( special normal* )*

So, this is an optimization technique where alternations are turned into linearly matching atoms.

This makes these unrolled patterns very efficient since they involve less backtracking.

Current Scenario

Your MINISTÉRIO[\s\S]*?PÁG is a non-unrolled pattern while MINISTÉRIO[^P]*(?:P(?!ÁG)[^P]*)*PÁG is. See the demos (both saved with PCRE option to show the number of steps in the box above. Regex performance is different across regex engines, but this will tell you exactly the performance difference). Add more text after text: the first regex will start requiring more steps to finish, the second one will only show more steps after adding P. So, in texts where the character you used in the known part is not common, unrolled patterns are very efficient.

See the Difference between .*?, .* and [^"]*+ quantifiers section in my answer to understand how lazy matching works (your [\s\S]*? is the same as .*? with a DOTALL modifier in languages that allow a . to match a newline, too).

Performance Question

Is the lazy matching pattern always slow and inefficient? It is not always so. With very short strings, lazy dot matching is usually better (1-10 symbols). When we talk about long inputs, where there can be the leading delimiter, and no trailing one, this may lead to excessive backtracking leading to time out issues.

Use unrolled patterns when you have arbitrary inputs of potentially long length and where there may be no match.

Use lazy matching when your input is controlled, you know there will always be a match, some known set log formats, or the like.

Bonus: Commonly Unrolled patterns

Tempered greedy tokens
Regular string literals ("String\u0020:\"text\""): "[^"\\]*(?:\\.[^"\\]*)*"
Multiline comment regex (/* Comments */): /\*[^*]*\*+(?:[^/*][^*]*\*+)*/
@<...>@ comment regex: @<[^>]*(?:>[^@]*)*@

Is it beneficial anymore to unroll loops in C++ over fixed-sized arrays?

The compiler is intelligent enough to automatically unroll loops for you, and you should trust that it's going to be able to do those (and many other) optimizations.

Generally speaking, the compiler is better at making micro-optimizations, while the programmer is better at making macro-optimizations.

Micro-optimizations (what the compiler CAN do):

Unroll loops
inline functions automatically
apply tail call optimization to speed up tail-recursive functions (many end up being as fast as the equivalent loop)
elide copies and moves: if you return something by value, in many cases the compiler can get rid of the copy or move entirely.
Use vectorized floating point instructions (this one still requires you to help the compiler sometimes, though)
Eliminate unnecessary or redundant if statements (for example, when you check something, and then cal a member function that also checks it, when it inlines the member function it'll eliminate the unnecessary check)
Inline lambdas passed to other functions (it only does this if you don't wrap it in std::function - it can't inline std::function)
Store local variables and even entire structs in the registers, instead of using RAM or Cache
Lots o' math optimizations

Macro-optimizations (what the compiler CAN'T do):

These are the things the programmer still has to pay attention to.

Change the way data is stored. If something doesn't need to be a pointer, store it on the stack!
Change the algorithm used to calculate something. Algorithm design is still important!
Other stuff

Will compiler unroll for loop when iterating a const container?

What is the type of {1, 2, 3, 4}?

std::initializer_list will be constructed from that initialiser. That is being iterated. You even need to include <initializer_list> for this to work.

Will compiler unroll the loop?

The language doesn't guarantee loop unrolling. You can find out whether a particular compiler unrolls a particular loop with particular options with particular target CPU by compiling and inspecting the produced assembly.

That said, the number of iterations is known at compile time, and therefore it is possible for the compiler to unroll the entire loop.

Suppose we are using -O2

For what it's worth, -O2 does not enable -funroll-loops. Before you add that option, read its documentation:

-funroll-loops
Unroll loops whose number of iterations can be determined at compile time or upon entry to the loop. -funroll-loops implies
-frerun-cse-after-loop. This option makes code larger, and may or may not make it run faster.

In this example, Clang did unroll the loop: https://godbolt.org/z/enKzMh while GCC did not: https://godbolt.org/z/ocfor8

Why do neither V8 nor spidermonkey seem to unroll static loops?

(V8 developer here.)

TL;DR: because it's rarely worth it for real-world code.

Loop unrolling, like other optimizations that increase code size (such as inlining), is a double-edged sword. Yes, it can help; in particular it often helps for tiny toy examples like the one posted here. But it can also hurt performance, most obviously because it increases the amount of work that the compiler has to do (and hence the time it takes to do that work), but also through secondary effects like bigger code benefitting less from the CPU's caching efforts.

V8's optimizing compiler actually does like to unroll the first iteration of loops. Also, as it happens, we currently have an ongoing project to unroll more loops; the current status is that it sometimes helps and sometimes hurts, so we're still fine-tuning the heuristics for when it should kick in, and when it shouldn't. This difficulty also indicates that for real-world JavaScript, the benefits will typically be quite small.

It doesn't matter whether it's a "standard for-loop" or not; any loop can be unrolled in theory. It just happens to be the case that aside from microbenchmarks, loop unrolling tends to make little difference: there isn't all that much overhead to just doing another iteration, so if the loop body does more than counter++, there's not a lot to be gained from avoiding the per-iteration overhead. And, fwiw, this per-iteration overhead is not what your test is measuring: the repeated increments all get folded, so what you're really comparing here is 100M iterations of counter += 1 against 10M iterations of counter += 10.

So this is one of many examples of misleading microbenchmarks trying to trick us into drawing the wrong conclusions ;-)

Unrolling For Loop in C

I'd say simply add the lines for i+1. But you have to be sure they are in the right order, so:

 for(i=0; i<100; i+=2){
    x[i] = y[i] + z[i];
    z[i] = y[i] + a[i];
    z[i+1] = y[i] * a[i];

    // next iteration
    if (i+1 < 100) {
          x[i+1] = y[i+1] + z[i+1];
          z[i+1] = y[i+1] + a[i+1];
          z[i+2] = y[i+1] * a[i+1]; 
    }
 }

EDIT

to make it safe for all upper bounds (not only even) you have to add an if in the loop

As Adrian Mole mentions, it could be better to check the upper bound in the first place or set the size of the array conveniently

SSE Intrinsics and loop unrolling

The answer is the first block:

    __m128 ai_v = _mm_loadu_ps(&a[i]);
    __m128 two_v = _mm_set1_ps(2);
    __m128 ai2_v = _mm_mul_ps(ai_v,two_v);
    _mm_storeu_ps(&b[i],ai2_v);

It already takes four variables at a time.

Here is the full program with the equivalent section of code commented out:

#include <iostream>

int main()
{
    int i{0};
    float a[10] ={1,2,3,4,5,6,7,8,9,10};
    float b[10] ={0,0,0,0,0,0,0,0,0,0};

    int n = 10;
    int unroll = (n/4)*4;
    for (i=0; i<unroll; i+=4) {
        //b[i] = a[i]*2;
        //b[i+1] = a[i+1]*2;
        //b[i+2] = a[i+2]*2;
        //b[i+3] = a[i+3]*2;
        __m128 ai_v = _mm_loadu_ps(&a[i]);
        __m128 two_v = _mm_set1_ps(2);
        __m128 ai2_v = _mm_mul_ps(ai_v,two_v);
        _mm_storeu_ps(&b[i],ai2_v);
    }

    for (; i<n; i++) {
        b[i] = a[i]*2;
    }

    for (auto i : a) { std::cout << i << "\t"; }
    std::cout << "\n";
    for (auto i : b) { std::cout << i << "\t"; }
    std::cout << "\n";

    return 0;
}

As for efficiency; it seems that the assembly on my system generates movups instructions, whereas the hand rolled code could be made to use movaps which should be faster.

I used the following program to do some benchmarks:

#include <iostream>
//#define NO_UNROLL
//#define UNROLL
//#define SSE_UNROLL
#define SSE_UNROLL_ALIGNED

int main()
{
    const size_t array_size = 100003;
#ifdef SSE_UNROLL_ALIGNED
    __declspec(align(16)) int i{0};
    __declspec(align(16)) float a[array_size] ={1,2,3,4,5,6,7,8,9,10};
    __declspec(align(16)) float b[array_size] ={0,0,0,0,0,0,0,0,0,0};
#endif
#ifndef SSE_UNROLL_ALIGNED
    int i{0};
    float a[array_size] ={1,2,3,4,5,6,7,8,9,10};
    float b[array_size] ={0,0,0,0,0,0,0,0,0,0};
#endif

    int n = array_size;
    int unroll = (n/4)*4;

    for (size_t j{0}; j < 100000; ++j) {
#ifdef NO_UNROLL
        for (i=0; i<n; i++) {
            b[i] = a[i]*2;
        }
#endif
#ifdef UNROLL
        for (i=0; i<unroll; i+=4) {
            b[i] = a[i]*2;
            b[i+1] = a[i+1]*2;
            b[i+2] = a[i+2]*2;
            b[i+3] = a[i+3]*2;
        }
#endif
#ifdef SSE_UNROLL
        for (i=0; i<unroll; i+=4) {
            __m128 ai_v = _mm_loadu_ps(&a[i]);
            __m128 two_v = _mm_set1_ps(2);
            __m128 ai2_v = _mm_mul_ps(ai_v,two_v);
            _mm_storeu_ps(&b[i],ai2_v);
        }
#endif
#ifdef SSE_UNROLL_ALIGNED
        for (i=0; i<unroll; i+=4) {
            __m128 ai_v = _mm_load_ps(&a[i]);
            __m128 two_v = _mm_set1_ps(2);
            __m128 ai2_v = _mm_mul_ps(ai_v,two_v);
            _mm_store_ps(&b[i],ai2_v);
        }
#endif
#ifndef NO_UNROLL
        for (; i<n; i++) {
            b[i] = a[i]*2;
        }
#endif
    }

    //for (auto i : a) { std::cout << i << "\t"; }
    //std::cout << "\n";
    //for (auto i : b) { std::cout << i << "\t"; }
    //std::cout << "\n";

    return 0;
}

I got the following results (x86):

NO_UNROLL: 0.994 seconds, no SSE chosen by compiler
UNROLL: 3.511 seconds, uses movups
SSE_UNROLL: 3.315 seconds, uses movups
SSE_UNROLL_ALIGNED: 3.276 seconds, uses movaps

So it is clear that unrolling the loop has not helped in this case. Even ensuring that we use the more efficient movaps doesn't help much.

But I got an even stranger result when compiling to 64 bit (x64):

NO_UNROLL: 1.138 seconds, no SSE chosen by compiler
UNROLL: 1.409 seconds, no SSE chosen by compiler
SSE_UNROLL: 1.420 seconds, still no SSE chosen by compiler!
SSE_UNROLL_ALIGNED: 1.476 seconds, still no SSE chosen by compiler!

It seems MSVC sees through the proposal and generates better assembly regardless, albeit still slower than had we not tried any hand optimization at all.

Loop unrolling & optimization

In the high-level view of a language, you're not going to see the optimization. The speed enhancement comes from what the compiler does with what you have.

In the first case, it's something like:

LOCATION_FLAG;
DO_SOMETHING;
TEST FOR LOOP COMPLETION;//Jumps to LOCATION_FLAG if false

In the second it's something like:

LOCATION_FLAG;
DO_SOMETHING;
DO_SOMETHING;
DO_SOMETHING;
TEST FOR LOOP COMPLETION;//Jumps to LOCATION_FLAG if false

You can see in the latter case, the overhead of testing and jumping is only 1 instruction per 3. In the first it's 1 instruction per 1; so it happens a lot more often.

Therefore, if you have invariants you can rely on (an array of mod 3, to use your example) then it is more efficient to unwind loops because the underlying assembly is written more directly.

How do optimizing compilers decide when and how much to unroll a loop?

When a compiler performs a loop unroll optimization, how does it determined by which factor to unroll the loop or weather to unroll the whole loop or not.

stack consumption and locality. instruction counts. ability to make/propagate optimizations based on the unrolled and inlined program. whether the loop size is fixed, or expected to be in a certain range. profile inputs (if applicable). operations which may be removed from the loop body. etc.

Since this is a space-performance tradeoff on average how effictive is this optimization technique in making the program perform better?

it depends largely on the input (your program). it can be slower (not typical) or it can be several times faster. writing a program to run optimally and which also enables the optimizer to do its job is learned.

Also, under what conditions is it recommended to use this technique (i.e certain operations or calculations)

generally, a large number of iterations on very small bodies, particularly that which is branchless and has good data locality.

if you want to know if the option helps your app, profile.

if you need more than that, you should reserve some time to learn how to write optimal programs, since the subject is quite complex.