When, if ever, is loop unrolling still useful?
Loop unrolling makes sense if you can break dependency chains. This gives a out of order or super-scalar CPU the possibility to schedule things better and thus run faster.
A simple example:
for (int i=0; i<n; i++)
{
sum += data[i];
}
Here the dependency chain of the arguments is very short. If you get a stall because you have a cache-miss on the data-array the cpu cannot do anything but to wait.On the other hand this code:
for (int i=0; i<n-3; i+=4) // note the n-3 bound for starting i + 0..3
{
sum1 += data[i+0];
sum2 += data[i+1];
sum3 += data[i+2];
sum4 += data[i+3];
}
sum = sum1 + sum2 + sum3 + sum4;
// if n%4 != 0, handle final 0..3 elements with a rolled up loop or whatever
could run faster. If you get a cache miss or other stall in one calculation there are still three other dependency chains that don't depend on the stall. A out of order CPU can execute these in parallel.(See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for an in-depth look at how register-renaming helps CPUs find that parallelism, and an in depth look at the details for FP dot-product on modern x86-64 CPUs with their throughput vs. latency characteristics for pipelined floating-point SIMD FMA ALUs. Hiding latency of FP addition or FMA is a major benefit to multiple accumulators, since latencies are longer than integer but SIMD throughput is often similar.)
Unroll Loop, when to use
What is Unroll-the-loop
See this Unroll the loop technique source:So, this is an optimization technique where alternations are turned into linearly matching atoms.This optimisation thechnique is used to optimize repeated alternation of the form
(expr1|expr2|...)*
. These expression are not uncommon, and the use of another repetition inside an alternation may also leads to super-linear match. Super-linear match arise from the underterministic expression(a*)*
.The unrolling the loop technique is based on the hypothesis that in most case, you kown in a repeteated alternation, which case should be the most usual and which one is exceptional. We will called the first one, the normal case and the second one, the special case. The general syntax of the unrolling the loop technique could then be written as:
normal* ( special normal* )*
This makes these unrolled patterns very efficient since they involve less backtracking.
Current Scenario
YourMINISTÉRIO[\s\S]*?PÁG
is a non-unrolled pattern while MINISTÉRIO[^P]*(?:P(?!ÁG)[^P]*)*PÁG
is. See the demos (both saved with PCRE option to show the number of steps in the box above. Regex performance is different across regex engines, but this will tell you exactly the performance difference). Add more text after text
: the first regex will start requiring more steps to finish, the second one will only show more steps after adding P
. So, in texts where the character you used in the known part is not common, unrolled patterns are very efficient. See the Difference between .*?
, .*
and [^"]*+
quantifiers section in my answer to understand how lazy matching works (your [\s\S]*?
is the same as .*?
with a DOTALL modifier in languages that allow a .
to match a newline, too).
Performance Question
Is the lazy matching pattern always slow and inefficient? It is not always so. With very short strings, lazy dot matching is usually better (1-10 symbols). When we talk about long inputs, where there can be the leading delimiter, and no trailing one, this may lead to excessive backtracking leading to time out issues.Use unrolled patterns when you have arbitrary inputs of potentially long length and where there may be no match.
Use lazy matching when your input is controlled, you know there will always be a match, some known set log formats, or the like.
Bonus: Commonly Unrolled patterns
Tempered greedy tokens
Regular string literals (
"String\u0020:\"text\""
):"[^"\\]*(?:\\.[^"\\]*)*"
Multiline comment regex (
/* Comments */
):/\*[^*]*\*+(?:[^/*][^*]*\*+)*/
@<...>@
comment regex:@<[^>]*(?:>[^@]*)*@
Is it beneficial anymore to unroll loops in C++ over fixed-sized arrays?
The compiler is intelligent enough to automatically unroll loops for you, and you should trust that it's going to be able to do those (and many other) optimizations.
Generally speaking, the compiler is better at making micro-optimizations, while the programmer is better at making macro-optimizations.
Micro-optimizations (what the compiler CAN do):
- Unroll loops
- inline functions automatically
- apply tail call optimization to speed up tail-recursive functions (many end up being as fast as the equivalent loop)
- elide copies and moves: if you return something by value, in many cases the compiler can get rid of the copy or move entirely.
- Use vectorized floating point instructions (this one still requires you to help the compiler sometimes, though)
- Eliminate unnecessary or redundant if statements (for example, when you check something, and then cal a member function that also checks it, when it inlines the member function it'll eliminate the unnecessary check)
- Inline lambdas passed to other functions (it only does this if you don't wrap it in
std::function
- it can't inlinestd::function
) - Store local variables and even entire structs in the registers, instead of using RAM or Cache
- Lots o' math optimizations
Macro-optimizations (what the compiler CAN'T do):
These are the things the programmer still has to pay attention to.
- Change the way data is stored. If something doesn't need to be a pointer, store it on the stack!
- Change the algorithm used to calculate something. Algorithm design is still important!
- Other stuff
Will compiler unroll for loop when iterating a const container?
What is the type of {1, 2, 3, 4}?
std::initializer_list
will be constructed from that initialiser. That is being iterated. You even need to include <initializer_list>
for this to work.Will compiler unroll the loop?The language doesn't guarantee loop unrolling. You can find out whether a particular compiler unrolls a particular loop with particular options with particular target CPU by compiling and inspecting the produced assembly.
That said, the number of iterations is known at compile time, and therefore it is possible for the compiler to unroll the entire loop.
Suppose we are using -O2For what it's worth, -O2 does not enable -funroll-loops. Before you add that option, read its documentation:
-funroll-loopsIn this example, Clang did unroll the loop: https://godbolt.org/z/enKzMh while GCC did not: https://godbolt.org/z/ocfor8Unroll loops whose number of iterations can be determined at compile time or upon entry to the loop. -funroll-loops implies
-frerun-cse-after-loop. This option makes code larger, and may or may not make it run faster.
Why do neither V8 nor spidermonkey seem to unroll static loops?
(V8 developer here.)
TL;DR: because it's rarely worth it for real-world code.
Loop unrolling, like other optimizations that increase code size (such as inlining), is a double-edged sword. Yes, it can help; in particular it often helps for tiny toy examples like the one posted here. But it can also hurt performance, most obviously because it increases the amount of work that the compiler has to do (and hence the time it takes to do that work), but also through secondary effects like bigger code benefitting less from the CPU's caching efforts.
V8's optimizing compiler actually does like to unroll the first iteration of loops. Also, as it happens, we currently have an ongoing project to unroll more loops; the current status is that it sometimes helps and sometimes hurts, so we're still fine-tuning the heuristics for when it should kick in, and when it shouldn't. This difficulty also indicates that for real-world JavaScript, the benefits will typically be quite small.
It doesn't matter whether it's a "standard for
-loop" or not; any loop can be unrolled in theory. It just happens to be the case that aside from microbenchmarks, loop unrolling tends to make little difference: there isn't all that much overhead to just doing another iteration, so if the loop body does more than counter++
, there's not a lot to be gained from avoiding the per-iteration overhead. And, fwiw, this per-iteration overhead is not what your test is measuring: the repeated increments all get folded, so what you're really comparing here is 100M iterations of counter += 1
against 10M iterations of counter += 10
.
So this is one of many examples of misleading microbenchmarks trying to trick us into drawing the wrong conclusions ;-)
Unrolling For Loop in C
I'd say simply add the lines for i+1. But you have to be sure they are in the right order, so:
for(i=0; i<100; i+=2){
x[i] = y[i] + z[i];
z[i] = y[i] + a[i];
z[i+1] = y[i] * a[i];
// next iteration
if (i+1 < 100) {
x[i+1] = y[i+1] + z[i+1];
z[i+1] = y[i+1] + a[i+1];
z[i+2] = y[i+1] * a[i+1];
}
}
EDITto make it safe for all upper bounds (not only even) you have to add an if in the loop
As Adrian Mole mentions, it could be better to check the upper bound in the first place or set the size of the array conveniently
SSE Intrinsics and loop unrolling
The answer is the first block:
__m128 ai_v = _mm_loadu_ps(&a[i]);
__m128 two_v = _mm_set1_ps(2);
__m128 ai2_v = _mm_mul_ps(ai_v,two_v);
_mm_storeu_ps(&b[i],ai2_v);
It already takes four variables at a time.Here is the full program with the equivalent section of code commented out:
#include <iostream>
int main()
{
int i{0};
float a[10] ={1,2,3,4,5,6,7,8,9,10};
float b[10] ={0,0,0,0,0,0,0,0,0,0};
int n = 10;
int unroll = (n/4)*4;
for (i=0; i<unroll; i+=4) {
//b[i] = a[i]*2;
//b[i+1] = a[i+1]*2;
//b[i+2] = a[i+2]*2;
//b[i+3] = a[i+3]*2;
__m128 ai_v = _mm_loadu_ps(&a[i]);
__m128 two_v = _mm_set1_ps(2);
__m128 ai2_v = _mm_mul_ps(ai_v,two_v);
_mm_storeu_ps(&b[i],ai2_v);
}
for (; i<n; i++) {
b[i] = a[i]*2;
}
for (auto i : a) { std::cout << i << "\t"; }
std::cout << "\n";
for (auto i : b) { std::cout << i << "\t"; }
std::cout << "\n";
return 0;
}
As for efficiency; it seems that the assembly on my system generates movups
instructions, whereas the hand rolled code could be made to use movaps
which should be faster.I used the following program to do some benchmarks:
#include <iostream>
//#define NO_UNROLL
//#define UNROLL
//#define SSE_UNROLL
#define SSE_UNROLL_ALIGNED
int main()
{
const size_t array_size = 100003;
#ifdef SSE_UNROLL_ALIGNED
__declspec(align(16)) int i{0};
__declspec(align(16)) float a[array_size] ={1,2,3,4,5,6,7,8,9,10};
__declspec(align(16)) float b[array_size] ={0,0,0,0,0,0,0,0,0,0};
#endif
#ifndef SSE_UNROLL_ALIGNED
int i{0};
float a[array_size] ={1,2,3,4,5,6,7,8,9,10};
float b[array_size] ={0,0,0,0,0,0,0,0,0,0};
#endif
int n = array_size;
int unroll = (n/4)*4;
for (size_t j{0}; j < 100000; ++j) {
#ifdef NO_UNROLL
for (i=0; i<n; i++) {
b[i] = a[i]*2;
}
#endif
#ifdef UNROLL
for (i=0; i<unroll; i+=4) {
b[i] = a[i]*2;
b[i+1] = a[i+1]*2;
b[i+2] = a[i+2]*2;
b[i+3] = a[i+3]*2;
}
#endif
#ifdef SSE_UNROLL
for (i=0; i<unroll; i+=4) {
__m128 ai_v = _mm_loadu_ps(&a[i]);
__m128 two_v = _mm_set1_ps(2);
__m128 ai2_v = _mm_mul_ps(ai_v,two_v);
_mm_storeu_ps(&b[i],ai2_v);
}
#endif
#ifdef SSE_UNROLL_ALIGNED
for (i=0; i<unroll; i+=4) {
__m128 ai_v = _mm_load_ps(&a[i]);
__m128 two_v = _mm_set1_ps(2);
__m128 ai2_v = _mm_mul_ps(ai_v,two_v);
_mm_store_ps(&b[i],ai2_v);
}
#endif
#ifndef NO_UNROLL
for (; i<n; i++) {
b[i] = a[i]*2;
}
#endif
}
//for (auto i : a) { std::cout << i << "\t"; }
//std::cout << "\n";
//for (auto i : b) { std::cout << i << "\t"; }
//std::cout << "\n";
return 0;
}
I got the following results (x86):NO_UNROLL
: 0.994 seconds, no SSE chosen by compilerUNROLL
: 3.511 seconds, usesmovups
SSE_UNROLL
: 3.315 seconds, usesmovups
SSE_UNROLL_ALIGNED
: 3.276 seconds, usesmovaps
movaps
doesn't help much.But I got an even stranger result when compiling to 64 bit (x64):
NO_UNROLL
: 1.138 seconds, no SSE chosen by compilerUNROLL
: 1.409 seconds, no SSE chosen by compilerSSE_UNROLL
: 1.420 seconds, still no SSE chosen by compiler!SSE_UNROLL_ALIGNED
: 1.476 seconds, still no SSE chosen by compiler!
Loop unrolling & optimization
In the high-level view of a language, you're not going to see the optimization. The speed enhancement comes from what the compiler does with what you have.
In the first case, it's something like:
LOCATION_FLAG;
DO_SOMETHING;
TEST FOR LOOP COMPLETION;//Jumps to LOCATION_FLAG if false
In the second it's something like:LOCATION_FLAG;
DO_SOMETHING;
DO_SOMETHING;
DO_SOMETHING;
TEST FOR LOOP COMPLETION;//Jumps to LOCATION_FLAG if false
You can see in the latter case, the overhead of testing and jumping is only 1 instruction per 3. In the first it's 1 instruction per 1; so it happens a lot more often.Therefore, if you have invariants you can rely on (an array of mod 3, to use your example) then it is more efficient to unwind loops because the underlying assembly is written more directly.
How do optimizing compilers decide when and how much to unroll a loop?
stack consumption and locality. instruction counts. ability to make/propagate optimizations based on the unrolled and inlined program. whether the loop size is fixed, or expected to be in a certain range. profile inputs (if applicable). operations which may be removed from the loop body. etc.When a compiler performs a loop unroll optimization, how does it determined by which factor to unroll the loop or weather to unroll the whole loop or not.
it depends largely on the input (your program). it can be slower (not typical) or it can be several times faster. writing a program to run optimally and which also enables the optimizer to do its job is learned.Since this is a space-performance tradeoff on average how effictive is this optimization technique in making the program perform better?
generally, a large number of iterations on very small bodies, particularly that which is branchless and has good data locality.Also, under what conditions is it recommended to use this technique (i.e certain operations or calculations)
if you want to know if the option helps your app, profile.
if you need more than that, you should reserve some time to learn how to write optimal programs, since the subject is quite complex.
Related Topics
Why Is 'This' Undefined Inside Class Method When Using Promises
Why Let and Var Bindings Behave Differently Using Settimeout Function
Event.Path Is Undefined Running in Firefox
Why Doesn't Nodelist Have Foreach
How to Add a "Readonly" Attribute to an <Input>
Get Selected HTML in Browser via JavaScript
Create Copy of Multi-Dimensional Array, Not Reference - JavaScript
Why Do I Need to Copy an Array to Use a Method on It
Mocking a Useragent in JavaScript
How to Send Data in Request Body with a Get When Using Jquery $.Ajax()
How to Sum the Values of a JavaScript Object
Read/Write Bytes of Float in Js
Nodejs Timeout a Promise If Failed to Complete in Time
What Is the Most Efficient Way to Reverse an Array in JavaScript
What the Difference Between Loadcomplete and Gridcomplete Events
Arrow VS Classic Method in Es6 Class
How to Reset the Scale/Zoom of a Web App on an Orientation Change on the Iphone