Why Is Foreach() %Do% Sometimes Slower Than For

Why is foreach() %do% sometimes slower than for?

for will run sqrt B times, presumably discarding the answer each time. foreach, however, returns a list containing the result of each execution of the loop body. This would contribute considerable extra overhead, regardless of whether it's running in parallel or sequential mode (%dopar% or %do%).

I based my answer by running the following code, which appears to be confirmed by the foreach vignette, which states "foreach differs from a for loop in that its return is a list of values, whereas a for loop has no value and uses side effects to convey its result."

> print(for(i in 1:10) sqrt(i))
NULL

> print(foreach(i = 1:10) %do% sqrt(i))
[[1]]
[1] 1

[[2]]
[1] 1.414214

[[3]]
... etc

UPDATE: I see from your updated question that the above answer isn't nearly sufficient to account for the performance difference. So I looked at the source code for foreach and can see that there is a LOT going on! I haven't tried to understand exactly how it works, however do.R and foreach.R show that even when %do% is run, large parts of the foreach configuration is still run, which would make sense if perhaps the %do% option is largely provided to allow you to test foreach code without having to have a parallel backend configured and loaded. It also needs to support the more advanced nesting and iteration facilities that foreach provides.

There are references in the code to results caching, error checking, debugging and the creation of local environment variables for the arguments of each iteration (see the function doSEQ in do.R for example). I'd imagine this is what creates the difference that you've observed. Of course, if you were running much more complicated code inside your loop (that would actually benefit from a parallelisation framework like foreach), this overhead would become irrelevant compared with the benefits it provides.

Why Array.forEach is slower than for() loop in Javascript?

Updated based on feedback from @BenAston & @trincot

Roughly, this is what's happening in both cases:

For loop

  1. Set the index variable to its initial value
  2. Check whether or not to exit the loop
  3. Run the body of your loop
  4. Increment the index variable
  5. Back to step 2

The only overhead that happens on every iteration is the check & the increment, which are very low-load operations.

forEach loop

  1. Instantiate the callback function
  2. Check if there's a next element to process
  3. Call the callback for the next element to process, with a new execution context (this comprises the "scope" of the function; so its context, arguments, inner variables, and references to any outer variables -- if used)
  4. Run the contents of your callback
  5. Teardown of callback function call
  6. Return to step 2

The overhead of the function setup & teardown in steps 3 & 5 here are much greater than that of incrementing & checking an integer for the vanilla for-loop.

That said, many modern browsers recognize & optimize forEach calls, and in some cases, the forEach might even be faster!

Why is R for loop 10 times slower than when using foreach?

foreach when used sequentially eventually uses compiler to produce compiled byte code using the non-exported functions make.codeBuf and cmp. You can use cmpfun to compile the innerloop into bytecode to simulate this and achieve a similar speedup.

f.original <- function() {
x <- 0
for (p in 1:2) {
for (i in 1:500) {
for (j in 1:5000) {
x <- x + i * j
}
}
}
x
}

f.foreach <- function() {
x <- 0
foreach(p = 1:2, .combine = rbind) %do%
for (i in 1:500) {
for (j in 1:5000) {
x <- x + i * j
}
}
x
}

f.cmpfun <- function(x) {
f <- cmpfun(function(x) {
for (i in 1:500) {
for (j in 1:5000) {
x <- x + i * j
}
}
x
})
f(f(0))
}

Results

library(microbenchmark)
microbenchmark(f.original(),f.foreach(),f.cmpfun(), times=5)
Unit: milliseconds
expr min lq median uq max neval
f.original() 4033.6114 4051.5422 4061.7211 4072.6700 4079.0338 5
f.foreach() 426.0977 429.6853 434.0246 437.0178 447.9809 5
f.cmpfun() 418.2016 427.9036 441.7873 444.1142 444.4260 5
all.equal(f.original(),f.foreach(),f.cmpfun())
[1] TRUE

foreach 20 times slower than for?

A foreach loop requires the use of an Enumerator to iterate over the collection, which requires accessing the Current property and calling the MoveNext method on each iteration, which take some processing time.

The for loop only has to call get_Item on each iteration, so that’s one less call than the foreach loop, which makes a slight difference in performance.

foreach %dopar% slower than for loop

It is specifically mentioned and illustrated with examples that indeed sometimes it's slower to set this up, because of having to combine the results from the separate parallel processes in the package doParallel.

Reference: http://cran.r-project.org/web/packages/doParallel/vignettes/gettingstartedParallel.pdf

Page 3:

With small tasks, the overhead of scheduling the task and returning
the result can be greater than the time to execute the task itself,
resulting in poor performance.

I used the example to find out that in some case, using the package resulted in 50% the time needed to execute the code.

Parallel.ForEach slower than normal foreach

It's performance ramifications of having a very small delegate body.

We can achieve better performance using the partitioning. In this case the body delegate performs work with a high data volume.

static int FindLargestParallelRange(int[] arr)
{
object locker = new object();
int largest = arr[0];
Parallel.ForEach(Partitioner.Create(0, arr.Length), () => arr[0], (range, loop, subtotal) =>
{
for (int i = range.Item1; i < range.Item2; i++)
if (arr[i] > subtotal)
subtotal = arr[i];
return subtotal;
},
(finalResult) =>
{
lock (locker)
if (largest < finalResult)
largest = finalResult;
});
return largest;
}

Pay attention to synchronize the localFinally delegate. Also note the need for proper initialization of the localInit: () => arr[0] instead of () => 0.

Partitioning with PLINQ:

static int FindLargestPlinqRange(int[] arr)
{
return Partitioner.Create(0, arr.Length)
.AsParallel()
.Select(range =>
{
int largest = arr[0];
for (int i = range.Item1; i < range.Item2; i++)
if (arr[i] > largest)
largest = arr[i];
return largest;
})
.Max();
}

I highly recommend free book Patterns of Parallel Programming by Stephen Toub.

Javascript efficiency: 'for' vs 'forEach'

for

for loops are much more efficient. It is a looping construct specifically designed to iterate while a condition is true, at the same time offering a stepping mechanism (generally to increase the iterator). Example:

for (var i=0, n=arr.length; i < n; ++i ) {
...
}

This isn't to suggest that for-loops will always be more efficient, just that JS engines and browsers have optimized them to be so. Over the years there have been compromises as to which looping construct is more efficient (for, while, reduce, reverse-while, etc) -- different browsers and JS engines have their own implementations that offer different methodologies to produce the same results. As browsers further optimize to meet performance demands, theoretically [].forEach could be implemented in such a way that it's faster or comparable to a for.

Benefits:

  • efficient
  • early loop termination (honors break and continue)
  • condition control (i<n can be anything and not bound to an array's size)
  • variable scoping (var i leaves i available after the loop ends)

forEach

.forEach are methods that primarily iterate over arrays (also over other enumerable, such as Map and Set objects). They are newer and provide code that is subjectively easier to read. Example:

[].forEach((val, index)=>{
...
});

Benefits:

  • does not involve variable setup (iterates over each element of the array)
  • functions/arrow-functions scope the variable to the block

    In the example above, val would be a parameter of the newly created function. Thus, any variables called val before the loop, would hold their values after it ends.
  • subjectively more maintainable as it may be easier to identify what the code is doing -- it's iterating over an enumerable; whereas a for-loop could be used for any number of looping schemes

Performance

Performance is a tricky topic, which generally requires some experience when it comes to forethought or approach. In order to determine ahead of time (while developing) how much optimization may be required, a programmer must have a good idea of past experience with the problem case, as well as a good understanding of potential solutions.

Using jQuery in some cases may be too slow at times (an experienced developer may know that), whereas other times may be a non-issue, in which case the library's cross-browser compliance and ease of performing other functions (e.g., AJAX, event-handling) would be worth the development (and maintenance) time saved.

Another example is, if performance and optimization was everything, there would be no other code than machine or assembly. Obviously that isn't the case as there are many different high level and low level languages, each with their own tradeoffs. These tradeoffs include, but are not limited to specialization, development ease and speed, maintenance ease and speed, optimized code, error free code, etc.

Approach

If you don't have a good understanding if something will require optimized code, it's generally a good rule of thumb to write maintainable code first. From there, you can test and pinpoint what needs more attention when it's required.

That said, certain obvious optimizations should be part of general practice and not required any thought. For instance, consider the following loop:

for (var i=0; i < arr.length; ++i ){}

For each iteration of the loop, JavaScript is retrieving the arr.length, a key-lookup costing operations on each cycle. There is no reason why this shouldn't be:

for (var i=0, n=arr.length; i < n; ++i){}

This does the same thing, but only retrieves arr.length once, caching the variable and optimizing your code.

Why does foreach %dopar% get slower with each additional node?

I find the per-node multiplication time very interesting because the timings don't include any of the overhead associated with the parallel loop, but only the time to perform the matrix multiplication, and they show that the time increases with the number of matrix multiplications executing in parallel on the same machine.

I can think of two reasons why that might happen:

  1. The memory bandwidth of the machine is saturated by the matrix multiplications before you run out of cores;
  2. The matrix multiplication is multi-threaded.

You can test for the first situation by starting multiple R sessions (I did this in multiple terminals), creating two matrices in each session:

> x <- matrix(rnorm(4096*4096), 4096)
> y <- matrix(rnorm(4096*4096), 4096)

and then executing a matrix multiplication in each of those sessions at about the same time:

> system.time(z <- x %*% t(y))

Ideally, this time will be the same regardless of the number of R sessions you use (up to the number of cores), but since matrix multiplication is a rather memory intensive operation, many machines will run out of memory bandwidth before they run out of cores, causing the times to increase.

If your R installation was built with a multi-threaded math library, such as MKL or ATLAS, then you could be using all of your cores with a single matrix multiplication, so you can't expect better performance by using multiple processes unless you use multiple computers.

You can use a tool such as "top" to see if you're using a multi-threaded math library.

Finally, the output from lscpu suggests that you're using a virtual machine. I've never done any performance testing on multi-core virtual machines, but that could also be a source of problems.


Update

I believe the reason that your parallel matrix multiplications run more slowly than a single matrix multiplication is that your CPU isn't able to read memory fast enough to feed more than about two cores at full speed, which I referred to as saturating your memory bandwidth. If your CPU had large enough caches, you might be able to avoid this problem, but it doesn't really have anything to do with the amount of memory that you have on your motherboard.

I think this is just a limitation of using a single computer for parallel computations. One of the advantages of using a cluster is that your memory bandwidth goes up as well as your total aggregate memory. So if you ran one or two matrix multiplications on each node of a multi-node parallel program, you wouldn't run into this particular problem.

Assuming you don't have access to a cluster, you could try benchmarking a multi-threaded math library such as MKL or ATLAS on your computer. It's very possible that you could get better performance running one multi-threaded matrix multiply than running them in parallel in multiple processes. But be careful when using both a multi-threaded math library and a parallel programming package.

You could also try using a GPU. They're obviously good at performing matrix multiplications.


Update 2

To see if the problem is R specific, I suggest that you benchmark the dgemm function, which is the BLAS function used by R to implement matrix multiplication.

Here's a simple Fortran program to benchmark dgemm. I suggest executing it from multiple terminals in the same way that I described for benchmarking %*% in R:

      program main
implicit none
integer n, i, j
integer*8 stime, etime
parameter (n=4096)
double precision a(n,n), b(n,n), c(n,n)
do i = 1, n
do j = 1, n
a(i,j) = (i-1) * n + j
b(i,j) = -((i-1) * n + j)
c(i,j) = 0.0d0
end do
end do
stime = time8()
call dgemm('N','N',n,n,n,1.0d0,a,n,b,n,0.0d0,c,n)
etime = time8()
print *, etime - stime
end

On my Linux machine, one instance runs in 82 seconds, while four instances run in 116 seconds. This is consistent with the results that I see in R and with my guess that this is a memory bandwidth problem.

You can also link this against different BLAS libraries to see which implementation works better on your machine.

You might also get some useful information about the memory bandwidth of your virtual machine network using pmbw - Parallel Memory Bandwidth Benchmark, although I've never used it.

In .NET, which loop runs faster, 'for' or 'foreach'?

Patrick Smacchia blogged about this last month, with the following conclusions:

  • for loops on List are a bit more than 2 times cheaper than foreach
    loops on List.
  • Looping on array is around 2 times cheaper than looping on List.
  • As a consequence, looping on array using for is 5 times cheaper
    than looping on List using foreach
    (which I believe, is what we all do).

Why is foreach so slow?

Maybe it has to do with the fact that foreach works on a copy of the array ?

Or maybe it has to do with the fact that, when looping with foreach, on each iteration, the internal array pointer is changed, to point to the next element ?

Quoting the relevant portion of foreach's manual page :

Note: Unless the array is referenced,
foreach operates on a copy of the
specified array and not the array
itself. foreach has some side effects
on the array pointer.


As far as I can tell, the third test you linked to doesn't do any of those two things -- which means both tests don't do the same thing -- which means you are not comparing two way of writing the same code.

(I would also say that this kind of micro-optimization will not matter at all in a real application -- but I guess you already know that, and just asked out of curiosity)

There is also one thing that doesn't feel right in this test : it only does the test one time ;; for a "better" test, it might be useful to test all of those more than once -- with timings in the order of 100 micro-seconds, not much is required to make a huge difference.

(Considering the first test varies between 300% and 500% on a few refreshes...)


For those who don't want to click, here's the first test (I've gotten 3xx%, 443%, and 529%) :

foreach($aHash as $key=>$val) {
$aHash[$key] .= "a";
}

And the third one (100%) :

$key = array_keys($aHash);
$size = sizeOf($key);
for ($i=0; $i<$size; $i++) {
$aHash[$key[$i]] .= "a";
}


Related Topics



Leave a reply



Submit