the difference between doMC and doParallel in R
The doParallel
package is a merger of doSNOW
and doMC
, much as parallel
is a merger of snow
and multicore
. But although doParallel
has all the features of doMC
, I was told by Rich Calaway of Revolution Analytics that they wanted to keep doMC
around because it was more efficient in certain circumstances, even though doMC
now uses parallel
just like doParallel
. I haven't personally run any benchmarks to determine if and when there is a significant difference.
I tend to use doMC
on a Linux or Mac OS X computer, doParallel
on a Windows computer, and doMPI
on a Linux cluster, but doParallel
does work on all of those platforms.
As for the different registration methods, if you execute:
registerDoParallel(cores=3)
on a Windows machine, it will create a cluster object implicitly for later use with clusterApplyLB
, whereas on Linux and Mac OS X, no cluster object is created or used. The number of cores is simply remembered and used as the value of the mc.cores
argument later when calling mclapply
.
If you execute:
cl <- makeCluster(3)
registerDoParallel(cl)
then the registered cluster object will be used with clusterApplyLB
regardless of the platform. You are correct that in this case, it is your responsibility to shutdown the cluster object since you created it, whereas the implicit cluster object is automatically shutdown.
What's the difference between using the doParallel package with type = MPI and using doMPI directly?
The "doParallel" package acts as a wrapper around the "clusterApplyLB" function which is implemented by calling functions from the "Rmpi" package when using an MPI cluster.
The "doMPI" package uses "Rmpi" functions directly and includes some features that aren't available in "clusterApplyLB":
supports fetching inputs and combining outputs on-the-fly to efficiently handle a large number of loop iterations;
supports MPI broadcast to initialize workers;
allows workers to be started either by mpirun or MPI spawn function.
R: Parallelization with doParallel and foreach
If you want to output something when using parallelism, use makeCluster(no_cores, outfile = "")
.
doParallel, cluster vs cores
The behavior of doParallel::registerDoParallel(<numeric>)
depends on the operating system, see print(doParallel::registerDoParallel)
for details.
On Windows machines,
doParallel::registerDoParallel(4)
effectively does
cl <- makeCluster(4)
doParallel::registerDoParallel(cl)
i.e. it set up four ("PSOCK") workers that run in background R sessions. Then, %dopar%
will basically utilize the parallel::parLapply()
machinery. With this setup, you do have to worry about global variables and packages being attached on each of the workers.
However, on non-Windows machines,
doParallel::registerDoParallel(4)
the result will be that %dopar%
will utilize the parallel::mclapply()
machinery, which in turn relies on forked processes. Since forking is used, you don't have to worry about globals and packages.
Difference between 'foreach' and 'parallel' in R?
foreach
can execute using either %do%
or %dopar%
... it only runs in parallel with %dopar%
More information available here: https://cran.r-project.org/web/packages/foreach/vignettes/foreach.pdf
Related Topics
Difference Between Mean(C(1,2,21)) and Mean(1,2,21)
Define All Functions in One .R File, Call Them from Another .R File. How, If Possible
How to Install Multiple Packages
Controlling the 'Alpha' Level in a Ggplot2 Legend
How to Check If a Sequence of Numbers Is Monotonically Increasing (Or Decreasing)
Is It Bad Practice to Access S4 Objects Slots Directly Using @
How to Change and Remove Default Library Location
How to Delete a Row from a Data.Frame Without Losing the Attributes
Search Within a String That Does Not Contain a Pattern
Create Top-To-Bottom Fade/Gradient Geom_Density in Ggplot2
Options for Deploying R Models in Production
Rearrange Dataframe to a Table, the Opposite of "Melt"
Difference Between As.Data.Frame(X) and Data.Frame(X)
How to Ignore Case When Using Str_Detect