Suggestions for Speeding Up Random Forests

How can I speed up the training of my random forest?

While I'm a fan of brute force techniques, such as parallelization or running a code for an extremely long time, I am an even bigger fan of improving an algorithm to avoid having to use a brute force technique.

While training your random forest using 2000 trees was starting to get prohibitively expensive, training with a smaller number of trees took a more reasonable time. For starters, you can train with say 4, 8, 16, 32, ..., 256, 512 trees and carefully observe metrics which let you know how robust the model is. These metrics include things like the best constant model (how well your forest performs on the data set versus a model which predicts the median for all inputs), as well as the out-of-bag error. In addition, you can observe the top predictors and their importance, and whether you start to see a convergence there as you add more trees.

Ideally, you should not have to use thousands of trees to build a model. Once your model begins to converge, adding more trees won't necessarily worsen the model, but at the same time it won't add any new information. By avoiding using too many trees you may be able to cut down a calculation which would have taken on the order of a week to less than a day. If, on top of this, you leverage a dozen CPU cores, then you might be looking at something on the order of hours.

To look at variable importance after each random forest run, you can try something along the lines of the following:

fit <- randomForest(...)
round(importance(fit), 2)

It is my understanding that the first say 5-10 predictors have the greatest impact on the model. If you notice that by increasing trees these top predictors don't really change position relative to each other, and the importance metrics seem to stay the same, then you might want to consider not using so many trees.

Suggestions for speeding up Random Forests

The manual of the foreach package has a section on Parallel Random Forests
(Using The foreach Package, Section 5.1):

> library("foreach")
> library("doSNOW")
> registerDoSNOW(makeCluster(4, type="SOCK"))

> x <- matrix(runif(500), 100)
> y <- gl(2, 50)

> rf <- foreach(ntree = rep(250, 4), .combine = combine, .packages = "randomForest") %dopar%
+    randomForest(x, y, ntree = ntree)
> rf
Call:
randomForest(x = x, y = y, ntree = ntree)
Type of random forest: classification
Number of trees: 1000

If we want want to create a random forest model with a 1000 trees, and our computer has four
cores, we can split up the problem into four pieces by executing the randomForest function four times, with the ntree argument set to 250. Of course, we have to combine the resulting randomForest objects, but the randomForest package comes with a function called combine.

How to improve randomForest performance?

(The tl;dr is you should a) increase nodesize to >> 1 and b) exclude very low-importance feature columns, maybe even exclude (say) 80% of your columns. Your issue is almost surely not na.roughfix, but if you suspect that, run na.roughfix separately as a standalone step, before calling randomForest. Get that red herring out of the way at first.)

Now, all of the following advice only applies until you blow out your memory limits, so measure your memory usage, and make sure you're not exceeding. (Start with ridiculously small parameters, then scale them up, measure the runtime, and keep checking it didn't increase disproportionately.)

The main parameters affecting the performance of randomForest are:

mtry (less is faster)
ntrees
number of features/cols in data - more is quadratically slower, or worse! See below
number of observations/rows in data
ncores (more is faster - as long as parallel option is being used)
some performance boost by setting importance=F and proximity=F (don't compute proximity matrix)
Never ever use the insane default nodesize=1, for classification! In Breiman's package, you can't directly set maxdepth, but use nodesize as a proxy for that, and also read all the good advice at: CrossValidated: "Practical questions on tuning Random Forests"
So here your data has 4.2e+5 rows, then if each node shouldn't be smaller than ~0.1%, try nodesize=42. (First try nodesize=420 (1%), see how fast it is, then rerun, adjusting nodesize down. Empirically determine a good nodesize for this dataset.)
runtime is proportional to ~ 2^D_max, i.e. polynomial to (-log1p(nodesize))
optionally you can also speedup by using sampling, see strata,sampsize arguments

Then a first-order estimate of runtime, denoting mtry=M, ntrees=T, ncores=C, nfeatures=F, nrows=R, maxdepth=D_max, is:

Runtime proportional to: T * F^2 * (R^1.something) * 2^D_max / C

(Again, all bets are off if you exceed memory. Also, try running on only one core, then 2, then 4 and verify you actually do get linear speedup. And not slowdown.)
(The effect of large R is worse than linear, maybe quadratic, since tree-partitioning has to consider all partitions of the data rows; certainly it's somewhat worse than linear. Check that by using sampling or indexing to only give it say 10% of rows).

Tip: keeping lots of crap low-importance features quadratically increases runtime, for a sublinear increase in accuracy. This is because at each node, we must consider all possible feature selection (or whatever number mtry) allows. And within each tree, we must consider all (F-choose-mtry) possible combinations of features.
So here's my methodology, doing "fast-and-dirty feature-selection for performance":

generate a tree normally (slow), although use a sane nodesize=42 or larger
look at rf$importances or randomForest::varImpPlot(). Pick only the top-K features, where you choose K; for a silly-fast example, choose K=3. Save that entire list for future reference.
now rerun the tree but only give it newdata[,importantCols]
confirm that speed is quadratically faster, and oob.error is not much worse
once you know your variable importances, you can turn off importance=F
tweak mtry and nodesize (tweak one at a time), rerun and measure speed improvement
plot your performance results on logarithmic axes
post us the results! Did you corroborate the above? Any comments on memory usage?

(Note that the above is not a statistically valid procedure for actual feature-selection, do not rely on it for that, read randomForest package for the actual proper methods for RF-based feature-selection.)

Get randomForest regression faster in R

Since you are using caret, you could use the method = "parRF". This is an implementation of parallel randomforest.

For example:

library(caret)
library(randomForest)
library(doParallel)

cores <- 3
cl <- makePSOCKcluster(cores)
registerDoParallel(cl)

dataset <- read.csv("/home/anonimo/Modelli/total_merge.csv", header=TRUE)
dati <- data.frame(dataset)
attach(dati)


trainSet <- dati[2:107570,]
testSet <- dati[107570:480343,]

# 3 times cross validation.
my_control <- trainControl(method = "cv", number = 3 )

my_forest <- train(Clip_pm25 ~ e_1 + Clipped_so + Clip_no2 + t2m_1 + tp_1 + Clipped_nh  +  Clipped_co + Clipped_o3 + ssrd_1 + Clipped_no + Clip_pm10 + sp_1, , 
                   data = trainSet,
                   method = "parRF",
                   ntree = 250,
                   trControl=my_control)

Here is a foreach implementation as well:

foreach_forest <- foreach(ntree=rep(250, cores), 
                          .combine=combine, 
                          .multicombine=TRUE, 
                          .packages="randomForest") %dopar%
   randomForest(Clip_pm25 ~ e_1 + Clipped_so + Clip_no2 + t2m_1 + tp_1 + Clipped_nh  +  Clipped_co + Clipped_o3 + ssrd_1 + Clipped_no + Clip_pm10 + sp_1, 
                   data = trainSet, ntree=ntree)

# don't forget to stop the cluster
stopCluster(cl)

Remember I didn't set any seeds. You might want to consider this as well. And here is a link to a randomforest package that also runs in parallel. But I have not tested this.

Why does specifying sampsize not speed up randomForest?

Your brackets are slightly off. Notice the difference between the following statements. You currently have:

ifelse(nrow(mtcars<10),nrow(mtcars), 10)

Which counts the number of rows in the boolean matrix mtcars<10 that has TRUE for each element in mtcars that is smaller than 10, and FALSE otherwise. You want:

ifelse(nrow(mtcars)<10,nrow(mtcars), 10)

Hope this helps.

Random Forest vs Support Vector Machine. Which is faster?

Recall how SVM works, it applies the kernel to each pair of the inputs, and this scales badly. SVM has time complexity of $O(dn^2)$ or $O(dn^3)$ for libsvm used by e1071.

Random forest uses independent decision trees. Fitting each tree is computationally cheap (that's one of the reasons we ensemble trees), it would be slower with larger number of trees, but they can be fitted in parallel. The time complexity is $O(n\log(n)dk)$.

SVM would scale worse than random forest and is generally not recommended for larger datasets.

Speed up cross validated random forest approach by parallel computing

My response is too verbose for a comment so hopefully I can help guide you. Here is a summary of the points in the comments above.

Our primary concern - Computation Time
One major limitation on randomforest computation - Number of Trees

The general idea is that as you increase the number trees, your randomforest model will improve (i.e. lower error). However, this increase in performance will diminish as you continue to add trees whereas computation time will continue to increase. As such, you reach a point of diminishing returns. So, how to do we determine how many trees to use?

Well, naively we could simply fit the randomForest model with the call you provide. Another option is to do cross-validation on ntree but that isn't implemented by default in caret and Max Kuhn (the author) really knows his stuff when it comes to predictive models. So, to get started you are on the correct track with your call you provided above:

randomForest(dependentVariable ~ ., data = dataSet, mtry = 522, ntree=3000, importance=TRUE, do.trace=100)

But let's make this reproducible and use the mlbench Sonar dataset.

library(mlbench)
data(Sonar)

But we currently don't care about variable importance at the moment so let's remove that. Also, your ntree is way too high to start. I would be surprised if you need it that high in the end. Starting at a lower level we have the following:

set.seed(825)
rf1 <- randomForest(Class~., data=Sonar, mtry=3, ntree=200, do.trace=25)

> rf1 <- randomForest(Class~., data=Sonar, mtry=3, ntree=200, do.trace=25)
ntree      OOB      1      2
   25:  16.83% 11.71% 22.68%
   50:  18.27% 12.61% 24.74%
   75:  17.31% 17.12% 17.53%
  100:  15.38% 12.61% 18.56%
  125:  15.38% 10.81% 20.62%
  150:  16.35% 13.51% 19.59%
  175:  15.87% 10.81% 21.65%
  200:  14.42%  8.11% 21.65%

As you can see, the OOB is bottoming out at approximately 100 trees. However, if I am uncertain or if the OOB is still dropping significantly I could run the call again with a larger number of trees. Once you have a working number of trees, then you can tune your mtry with caret::training.

If you do end up needing to use lots of trees (i.e. 1000's) then your code is likely going to be slow. If you have access to machines that have many processors and large amounts of RAM then your parallel implementation can help but on more common machines it will be slow going.

Why does performance suffer when fitting a Random Forest model after reducing with PCA?

Difference in features

You said that originally you have784 features, but you reduce it to 154. That may seem like a lot. However if you look at the documentation:

max_features : int, float, string or None, optional (default=”auto”)

The number of features to consider when looking for the best split:

If “auto”, then max_features=sqrt(n_features).

That means that your original problem was sqrt(784) = 28 and you reduced it to sqrt(154) = 12.

Yes, it is smaller now, but not as small as you originally thought.

Optimization

The way that your Random forest is built, is by looking at possible splits and picking the best ones according to a certain criteria. Note the documentation:

criterion : string, optional (default=”gini”)

The function to measure the quality of a split. Supported criteria are
“gini” for the Gini impurity and “entropy” for the information gain.
Note: this parameter is tree-specific.

[...]

Note: the search for a split does not stop until at least one valid
partition of the node samples is found, even if it requires to
effectively inspect more than max_features features.

So, while fitting, the algorithm is iterating over possible splits that optimize the criterion. However, by reducing the number of features you might have made the problem to find this splits more difficult (by having less good splits to find), which makes the algorithm need more iterations to find a good split.