Why Is the Parallel Package Slower Than Just Using Apply

Why is the parallel package slower than just using apply?

Running jobs in parallel incurs overhead. Only if the jobs you fire at the worker nodes take a significant amount of time does parallelization improve overall performance. When the individual jobs take only milliseconds, the overhead of constantly firing off jobs will deteriorate overall performance. The trick is to divide the work over the nodes in such a way that the jobs are sufficiently long, say at least a few seconds. I used this to great effect running six Fortran models simultaneously, but these individual model runs took hours, almost negating the effect of overhead.

Note that I haven't run your example, but the situation I describe above is often the issue when parallization takes longer than running sequentially.

Why using parallel computing package makes my R code run slower

Before going to parallel processing you should try to improve the single core performance. Without seeing your code we cannot give any concrete advice, but the first step should be to profile your code. Useful resources are
http://adv-r.had.co.nz/Performance.html and
https://csgillespie.github.io/efficientR/.

Once you have achieved good single core performance, you can try parallel processing. As hinted in the comments, it is crucial to keep the communication overhead low. Again, without seeing your code we cannot give any concrete advice, but here is some general advice:

Do not use a sequence of multiple parallelized steps. A single parallelized step which does all the work in sequence will have lower communication overhead.
Use a reasonable chunk size. If you have 10.000 tasks then don't send the individually but in suitable groups. The parallel package does that by default as long as you do not use "load balancing". If you need load balancing for some reason, then you should group the tasks into a smaller number of chunks to be handled by the load balancing algorithm.

R: Why parallel is (much) slower? What is best strategy in using parallel for a (left) join a large collection of big files?

There are two issues why your multithreading is slow:

1) Data transfer to new threads
2) Data transfer from new threads back to main threads

Issues #1 is completely avoided by using mclapply, which doesn't copy data unless it is modified, on unix systems. (makeCluster by default uses sockets to transfer data).

Issue #2 cannot be avoided using mclapply, but what you can do is to minimize the amount of data you transfer back to the main thread.

Naive mclapply:

join3 = mclapply(1:10, function(j) {
  join_i=chunk_join(j, A, B, C)
}, mc.cores=4) %>% rbindlist

Slighty smarter mclapply:

chunk_join2=function(i, A, B, C)
{
  A_i=A %>% filter(X2==i)
  B_i=B %>% filter(X2==i) %>% select(X1, X3)
  C_i=C %>% filter(X2==i) %>% select(X1, X3)
  join_i=A_i %>% left_join(B_i, by=c('X3')) %>% left_join(C_i, by=c('X3'))
  join_i[,c(-1,-2,-3)]
}
A <- arrange(A, X2)
join5 = mclapply(1:10, function(j) {
  join_i=chunk_join2(j, A, B, C)
}, mc.cores=4) %>% rbindlist
join5 <- cbind(A, join5)

Benchmarks:

Single threaded: 4.014s 

Naive mclapply: 1.860 s

Slightly smarter mclapply: 1.363 s

If your data has a lot of columns, you can see how Issue #2 will completely bog down the system. You can do even better by e.g. returning the indices of B and C instead of whole data.frame subset.

Why is mclappy slower than apply in this case?

It looks like mclapply compares pretty well against lapply, but lapply does not compare well against apply. The reason may be that you're iterating over the rows of q with apply, and you're iterating over the columns of q with lapply and mclapply. That may account for the performance difference.

If you really do want to iterate over the rows of q, you could create ql using:

ql <- lapply(seq_len(nrow(x)), function(i) x[i,])

If you want to iterate over the columns of q, then you should set MARGIN=2 in apply, as suggested by @flodel.

Both lapply and mclapply will iterate over the columns of a data frame, so you can create ql with:

ql <- as.data.frame(q)

This makes sense since a data frame actually is a list.

R parallel package - performance very slow in my toy example

Distributing tasks to different nodes takes a lot of computational overhead and can cancel out any gains you make from parallelizing your script. In your case, you're calling parLapply 10,000 times and probably spending more resources forking each task than actually doing the resampling. Try something like this with a non-parallel version of ratio_sim_par:

mclapply(1:10000, ratio_sim_par, x1, x2, nrep = 1000, mc.cores = n_cores)

mclapply will split the job into as many cores as you have available and fork it once. I'm using mclapply instead of parLapply because I'm used to it and doesn't require as much setup.

Why is this parallel code slower than its similar non parallel version?

Well, the best answer you can get is to run a profiler tool and measure what is going on with your code. But my educated guess is that your parallel code is slower because your code is so simple that starting up threads and switching between them add up so much cost that any advantage in the calculation speed is negligible.

But try to make some substantial computations and you eventually will have the parallel execution advantage. Your code is too simple. Modern CPUs are not to be loaded in this way.

Why Is the Parallel Package Slower Than Just Using Apply