Understanding the differences between mclapply and parLapply in R
The beauty of mclapply
is that the worker processes are all created as clones of the master right at the point that mclapply
is called, so you don't have to worry about reproducing your environment on each of the cluster workers. Unfortunately, that isn't possible on Windows.
When using parLapply
, you generally have to perform the following additional steps:
- Create a PSOCK cluster
- Register the cluster if desired
- Load necessary packages on the cluster workers
- Export necessary data and functions to the global environment of the cluster workers
Also, when you're done, it's good practice to shutdown the PSOCK cluster using stopCluster
.
Here's a translation of your example to parLapply
:
library(parallel)
cl <- makePSOCKcluster(4)
setDefaultCluster(cl)
adder <- function(a, b) a + b
clusterExport(NULL, c('adder'))
parLapply(NULL, 1:8, function(z) adder(z, 100))
If your adder
function requires a package, you'll have to load that package on each of the workers before calling it with parLapply
. You can do that quite easily with clusterEvalQ
:
clusterEvalQ(NULL, library(MASS))
Note that the NULL
first argument to clusterExport
, clusterEval
and parLapply
indicates that they should use the cluster object registered via setDefaultCluster
. That can be very useful if your program is using mclapply
in many different functions, so that you don't have to pass the cluster object to every function that needs it when converting your program to use parLapply
.
Of course, adder
may call other functions in your global environment which call other functions, etc. In that case, you'll have to export them as well and load any packages that they need. Also note that if any variables that you've exported change during the course of your program, you will have to export them again in order to update them on the cluster workers. Again, that isn't necessary with mclapply
because it always creates/clones/forks the workers whenever it is called, making that unnecessary.
mclapply vs parLapply speeds
Some quick benchmarks suggest that mclapply
could be slightly faster, but this probably depends on the specific system and problem. The more balanced the jobs and the slower the actual tasks the less it should matter, which function you use.
library(parallel)
library(microbenchmark)
microbenchmark(
parLapply = {cl <- makeCluster(2)
parLapply(cl, rep(1:7, 3), function(x) {set.seed(1); rnorm(10^x)})
stopCluster(cl)},
mclapply = {mclapply(rep(1:7 , 3), function(x) {set.seed(1); rnorm(10^x)}, mc.cores = 2)},
times = 10
)
#Unit: seconds
# expr min lq mean median uq max neval
#parLapply 1.85548 2.04397 3.332970 3.071284 4.323514 6.294364 10
#mclapply 1.62610 1.65288 2.217407 1.849594 2.243418 5.435189 10
microbenchmark(
parLapply = {cl <- makeCluster(2)
parLapply(cl, rep(6, 20), function(x) {set.seed(1); rnorm(10^x)})
stopCluster(cl)},
mclapply = {mclapply(rep(6, 20), function(x) {set.seed(1); rnorm(10^x)}, mc.cores = 2)},
times = 10
)
#Unit: milliseconds
# expr min lq mean median uq max neval
#parLapply 1150.657 1188.9750 1705.1364 1242.739 2071.276 3785.516 10
# mclapply 820.692 932.2262 994.4404 1000.402 1079.930 1117.863 10
sessionInfo()
#R version 3.3.1 (2016-06-21)
#Platform: x86_64-pc-linux-gnu (64-bit)
#Running under: Ubuntu 14.04.5 LTS
#
#locale:
# [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8
# [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8 LC_PAPER=de_DE.UTF-8 LC_NAME=C
# [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
#
#attached base packages:
#[1] parallel stats graphics grDevices utils datasets methods base
#
#other attached packages:
#[1] microbenchmark_1.4-2.1 doParallel_1.0.10 iterators_1.0.8 foreach_1.4.3
#
#loaded via a namespace (and not attached):
# [1] colorspace_1.2-6 scales_0.4.0 plyr_1.8.4 tools_3.3.1 gtable_0.2.0 Rcpp_0.12.4
# [7] ggplot2_2.1.0 codetools_0.2-14 grid_3.3.1 munsell_0.4.3
Is mclapply() with mc.cores = 1 the same as lapply()?
The source code of parallel::mclapply
contains this bit of code:
...
if (cores < 2L)
return(lapply(X = X, FUN = FUN, ...))
...
So I believe the answer is yes, you should get the same results as using lapply
directly, but there is also some additional overhead. I doubt that this will affect the runtime very significantly.
The documentation also states that:
Details
mclapply is a parallelized version of lapply, provided mc.cores > 1:
for mc.cores == 1 it simply calls lapply.
Versions of lapply() and mclapply() that avoid redundant processing
This actually seems to work:
lightly_parallelize_atomic <- function(X, FUN, jobs = 1, ...){
keys <- unique(X)
index <- match(X, keys)
values <- mclapply(X = keys, FUN = FUN, mc.cores = jobs, ...)
values[index]
}
And in my case, it's okay that X
is atomic.
But it would be neat to find something already built into either a package or R natively.
R, the environment of mclapply and removing variables
You should call the gc
function after removing the variable so that the memory associated with the object is freed by the garbage collector sooner rather than later. The rm
function only removes the reference to the data, while the actual object may continue to exist until the garbage collector eventually runs.
You may also want to call gc
before the first mclapply
to make testing easier:
gc()
opt.Models = mclapply(1:100, mc.cores=20, function(i){
res = loadResult(reg, id=i)
return(post.Process(res))
})
# presumably do something with opt.Models...
rm(opt.Models)
gc() # free up memory before forking
opt.Models = mclapply(1:100, mc.cores=20, function(i){
res = loadResult(reg, id=i)
return(post.Process(res))
})
Related Topics
R Random Forests Variable Importance
Understanding the Differences Between Mclapply and Parlapply in R
Reading in Chunks at a Time Using Fread in Package Data.Table
Change Background Color of R Plot
Running Cor() (Or Any Variant) Over a Sparse Matrix in R
How to Remove the Legend Title in Ggplot2
R: Cumulative Sum Over Rolling Date Range
In R Plot Arima Fitted Model with the Original Series
How to Install Dependencies When Using "R Cmd Install" to Install R Packages
R: Multiple Linear Regression Model and Prediction Model
Comparison Between Dplyr::Do/Purrr::Map, What Advantages
How to Split the Main Title of a Plot in 2 or More Lines
View the Source of an R Package
How to Stop Emacs from Replacing Underbar with <- in Ess-Mode