Why is the parallel package slower than just using apply?
Running jobs in parallel incurs overhead. Only if the jobs you fire at the worker nodes take a significant amount of time does parallelization improve overall performance. When the individual jobs take only milliseconds, the overhead of constantly firing off jobs will deteriorate overall performance. The trick is to divide the work over the nodes in such a way that the jobs are sufficiently long, say at least a few seconds. I used this to great effect running six Fortran models simultaneously, but these individual model runs took hours, almost negating the effect of overhead.
Note that I haven't run your example, but the situation I describe above is often the issue when parallization takes longer than running sequentially.
Why using parallel computing package makes my R code run slower
Before going to parallel processing you should try to improve the single core performance. Without seeing your code we cannot give any concrete advice, but the first step should be to profile your code. Useful resources are
http://adv-r.had.co.nz/Performance.html and
https://csgillespie.github.io/efficientR/.
Once you have achieved good single core performance, you can try parallel processing. As hinted in the comments, it is crucial to keep the communication overhead low. Again, without seeing your code we cannot give any concrete advice, but here is some general advice:
- Do not use a sequence of multiple parallelized steps. A single parallelized step which does all the work in sequence will have lower communication overhead.
- Use a reasonable chunk size. If you have 10.000 tasks then don't send the individually but in suitable groups. The
parallel
package does that by default as long as you do not use "load balancing". If you need load balancing for some reason, then you should group the tasks into a smaller number of chunks to be handled by the load balancing algorithm.
R: Why parallel is (much) slower? What is best strategy in using parallel for a (left) join a large collection of big files?
There are two issues why your multithreading is slow:
1) Data transfer to new threads
2) Data transfer from new threads back to main threads
Issues #1 is completely avoided by using mclapply
, which doesn't copy data unless it is modified, on unix systems. (makeCluster
by default uses sockets to transfer data).
Issue #2 cannot be avoided using mclapply
, but what you can do is to minimize the amount of data you transfer back to the main thread.
Naive mclapply:
join3 = mclapply(1:10, function(j) {
join_i=chunk_join(j, A, B, C)
}, mc.cores=4) %>% rbindlist
Slighty smarter mclapply:
chunk_join2=function(i, A, B, C)
{
A_i=A %>% filter(X2==i)
B_i=B %>% filter(X2==i) %>% select(X1, X3)
C_i=C %>% filter(X2==i) %>% select(X1, X3)
join_i=A_i %>% left_join(B_i, by=c('X3')) %>% left_join(C_i, by=c('X3'))
join_i[,c(-1,-2,-3)]
}
A <- arrange(A, X2)
join5 = mclapply(1:10, function(j) {
join_i=chunk_join2(j, A, B, C)
}, mc.cores=4) %>% rbindlist
join5 <- cbind(A, join5)
Benchmarks:
Single threaded: 4.014s
Naive mclapply: 1.860 s
Slightly smarter mclapply: 1.363 s
If your data has a lot of columns, you can see how Issue #2 will completely bog down the system. You can do even better by e.g. returning the indices of B and C instead of whole data.frame subset.
Why is mclappy slower than apply in this case?
It looks like mclapply
compares pretty well against lapply
, but lapply
does not compare well against apply
. The reason may be that you're iterating over the rows of q
with apply
, and you're iterating over the columns of q
with lapply
and mclapply
. That may account for the performance difference.
If you really do want to iterate over the rows of q
, you could create ql
using:
ql <- lapply(seq_len(nrow(x)), function(i) x[i,])
If you want to iterate over the columns of q
, then you should set MARGIN=2
in apply
, as suggested by @flodel.
Both lapply
and mclapply
will iterate over the columns of a data frame, so you can create ql
with:
ql <- as.data.frame(q)
This makes sense since a data frame actually is a list.
R parallel package - performance very slow in my toy example
Distributing tasks to different nodes takes a lot of computational overhead and can cancel out any gains you make from parallelizing your script. In your case, you're calling parLapply
10,000 times and probably spending more resources forking each task than actually doing the resampling. Try something like this with a non-parallel version of ratio_sim_par
:
mclapply(1:10000, ratio_sim_par, x1, x2, nrep = 1000, mc.cores = n_cores)
mclapply
will split the job into as many cores as you have available and fork it once. I'm using mclapply
instead of parLapply
because I'm used to it and doesn't require as much setup.
Why is this parallel code slower than its similar non parallel version?
Well, the best answer you can get is to run a profiler tool and measure what is going on with your code. But my educated guess is that your parallel code is slower because your code is so simple that starting up threads and switching between them add up so much cost that any advantage in the calculation speed is negligible.
But try to make some substantial computations and you eventually will have the parallel execution advantage. Your code is too simple. Modern CPUs are not to be loaded in this way.
Related Topics
Long/Bigint/Decimal Equivalent Datatype in R
Scatterplot With Marginal Histograms in Ggplot2
Applying a Function to Every Row of a Table Using Dplyr
What Are Replacement Functions in R
Difference Between the == and %In% Operators in R
Create Group Number For Contiguous Runs of Equal Values
Calculating Statistics on Subsets of Data
Posix Character Class Does Not Work in Base R Regex
Add a Variable to a Data Frame Containing Max Value of Each Row
How Does One Reorder Columns in a Data Frame
Using Regex in R to Find Strings as Whole Words (But Not Strings as Part of Words)
Construct a Manual Legend For a Complicated Plot
Intelligent Point Label Placement in R
Subset Rows in a Data Frame Based on a Vector of Values
Dplyr: Nonstandard Column Names (White Space, Punctuation, Starts With Numbers)