Parallel Execution of Random Forest in R

parallel execution of random forest in R

Setting .multicombine to TRUE can make a significant difference:

rf <- foreach(ntree=rep(25000, 6), .combine=randomForest::combine,
              .multicombine=TRUE, .packages='randomForest') %dopar% {
    randomForest(x, y, ntree=ntree)
}

This causes combine to be called once rather than five times. On my desktop machine, this runs in 8 seconds rather than 19 seconds.

Parallelizing random forests

There are a few answers on SO, such as parallel execution of random forest in R and Suggestions for speeding up Random Forests, that I would take a look at.

Those posts are helpful, but are a bit older. the ranger package is an especially fast implementation of random forest, so if you are new to this it might be the easiest way to speed up your model training. Their paper discusses the tradeoffs of some of the available packages - depending on your data size and number of features, which package gives you the best performance will vary.

How to run randomForest in R on multiple cores in parallel?

I use the doMC package and its registerDoMC function. Works really well.

Small speed gain with parallel execution of random forest in Macbook (using R, caret)

The package 'ranger' you are using does have an internal multithreading support. That's why you are observing CPU usage aroung 300..330% in the first case - which means it already uses at least 3 cores for training.

When using doParallel, use are using multiprocessing instead of multithreading, but the total number of computing resources used in training is nearly the same, so you are not seeing much gain.

Parallel Execution of Random Forest in R