R Random Forests Variable Importance

R Random Forests Variable Importance

An explanation that uses the words 'error', 'summation', or 'permutated'
would be less helpful then a simpler explanation that didn't involve any
discussion of how random forests works.

Like if I wanted someone to explain to me how to use a radio, I wouldn't
expect the explanation to involve how a radio converts radio waves into sound.

How would you explain what the numbers in WKRP 100.5 FM "mean" without going into the pesky technical details of wave frequencies? Frankly parameters and related performance issues with Random Forests are difficult to get your head around even if you understand some technical terms.

Here's my shot at some answers:

-mean raw importance score of variable x for class 0

-mean raw importance score of variable x for class 1

Simplifying from the Random Forest web page, raw importance score measures how much more helpful than random a particular predictor variable is in successfully classifying data.

-MeanDecreaseAccuracy

I think this is only in the R module, and I believe it measures how much inclusion of this predictor in the model reduces classification error.

-MeanDecreaseGini

Gini is defined as "inequity" when used in describing a society's distribution of income, or a measure of "node impurity" in tree-based classification. A low Gini (i.e. higher descrease in Gini) means that a particular predictor variable plays a greater role in partitioning the data into the defined classes. It's a hard one to describe without talking about the fact that data in classification trees are split at individual nodes based on values of predictors. I'm not so clear on how this translates into better performance.

Variable Importance for Individual classes in R using Caret

Just add importance=TRUE in the train function, which is the same to do importance(odFit) in the randomForest package.

Here a reproducible example:

library(caret)
data(iris)

control <- trainControl(method = "cv",10)
tunegrid <- expand.grid(mtry=2:ncol(iris)-1)
odFit = train(x = iris[,-5],
y = iris$Species,
ntree=20,
trControl = control,
tuneGrid = tunegrid,
importance=T
)
odFit

varImp(odFit)

and here is the output

rf variable importance

variables are sorted by maximum importance across the classes
setosa versicolor virginica
Petal.Width 57.21 73.747 100.00
Petal.Length 61.90 79.981 77.49
Sepal.Length 20.01 2.867 40.47
Sepal.Width 20.01 0.000 15.73

you can plot the variable importance with ggplot

library(ggplot2)
vi <- varImp(odFit,scale=T)[[1]]
vi$var <-row.names(vi)
vi <- reshape2::melt(vi)

ggplot(vi,aes(value,var,col=variable))+
geom_point()+
facet_wrap(~variable)

Sample Image

Random forest variable importance AND direction of correlation for binomial response

You could use something like an average marginal effect (or like below, an average first difference) approach.

First, I'll make some data

set.seed(11)
n = 200
p = 5
X = data.frame(matrix(runif(n * p), ncol = p))
yhat = 10 * sin(pi* X[ ,1] * X[,2]) +20 *
(X[,3] -.5)^2 + 10 * -X[ ,4] + 5 * -X[,5]
y = as.numeric((yhat+ rnorm(n)) > mean(yhat))
df <- as.data.frame(cbind(X,y))

Next, we'll estimate the RF model:

library(randomForest)
rf <- randomForest(as.factor(y) ~ ., data=df)

Net, we can loop through each variable, in each time through the loop, we're adding one standard deviation to a single x variable for all observations. In your approach, you could also change from one category to another for categorical variables. Then, we predict the probability of a positive response under both conditions - the original condition and the one with a standard deviation added to each variable. Then we could take the difference and summarize.

nx <- names(df)
nx <- nx[-which(nx == "y")]
res <- NULL
for(i in 1:length(nx)){
p1 <- predict(rf, newdata=df, type="prob")
df2 <- df
df2[[nx[i]]] <- df2[[nx[i]]] + sd(df2[[nx[i]]])
p2 <- predict(rf, newdata=df2, type="prob")
diff <- (p2-p1)[,2]
res <- rbind(res, c(mean(diff), sd(diff)))
}
colnames(res) <- c("effect", "sd")
rownames(res) <- nx
res
# effect sd
# X1 0.11079 0.18491252
# X2 0.10265 0.16552070
# X3 0.02015 0.07951409
# X4 -0.11687 0.16671916
# X5 -0.04704 0.10274836

R - Interpreting Random Forest Importance

In order to interpret my results in a research paper, I need to understand whether the variables have a positive or negative impact on the response variable.

You need to be perform "feature impact" analysis, not "feature importance" analysis.

Algorithmically, it's about traversing decision tree data structures and observing what was the impact of each split on the prediction outcome. For example, consider the split "age <= 40". Does the left branch (condition evaluates to true) carry lower likelihood than the right branch (condition evaluates to false)?

Feature importances may give you a hint which features to look for, but it cannot be "transformed" to feature impacts.

You might find the following articles helpful: WHY did your model predict THAT? (Part 1 of 2) and WHY did your model predict THAT? (Part 2 of 2).

Variable importance plot using randomforest package in R

It should be importance=TRUE instead of Importance=TRUE. Please, see below for reproducible example:

rf = randomForest(Species ~ .,data=iris,Importance=TRUE)
importance(rf,type=1)

Sepal.Length
Sepal.Width
Petal.Length
Petal.Width

rf = randomForest(Species ~ .,data=iris,importance=TRUE)
importance(rf,type=1)
MeanDecreaseAccuracy
Sepal.Length 10.035280
Sepal.Width 4.849584
Petal.Length 32.512948
Petal.Width 34.386394

Measures of variable importance in random forests

The first one can be 'interpreted' as follows: if a predictor is important in your current model, then assigning other values for that predictor randomly but 'realistically' (i.e.: permuting this predictor's values over your dataset), should have a negative influence on prediction, i.e.: using the same model to predict from data that is the same except for the one variable, should give worse predictions.

So, you take a predictive measure (MSE) with the original dataset and then with the 'permuted' dataset, and you compare them somehow. One way, particularly since we expect the original MSE to always be smaller, the difference can be taken. Finally, for making the values comparable over variables, these are scaled.

For the second one: at each split, you can calculate how much this split reduces node impurity (for regression trees, indeed, the difference between RSS before and after the split). This is summed over all splits for that variable, over all trees.

Note: a good read is Elements of Statistical Learning by Hastie, Tibshirani and Friedman...

Customizing Importance Plot - R

Here are a couple of options:

library(randomForest)
library(tidyverse)

# Random forest model
iris.rf <- randomForest(Species ~ ., data=iris, importance=TRUE)

# Get importance values as a data frame
imp = as.data.frame(importance(iris.rf))
imp = cbind(vars=rownames(imp), imp)
imp = imp[order(imp$MeanDecreaseAccuracy),]
imp$vars = factor(imp$vars, levels=unique(imp$vars))

barplot(imp$MeanDecreaseAccuracy, names.arg=imp$vars)

Sample Image

imp %>% 
pivot_longer(cols=matches("Mean")) %>%
ggplot(aes(value, vars)) +
geom_col() +
geom_text(aes(label=round(value), x=0.5*value), size=3, colour="white") +
facet_grid(. ~ name, scales="free_x") +
scale_x_continuous(expand=expansion(c(0,0.04))) +
theme_bw() +
theme(panel.grid.minor=element_blank(),
panel.grid.major=element_blank(),
axis.title=element_blank())

Sample Image

I also wouldn't give up on the dotchart, which (IMHO) is a cleaner visualization. Here are options that are more customized than the built-in output in your question:

dotchart(imp$MeanDecreaseAccuracy, imp$vars, 
xlim=c(0,max(imp$MeanDecreaseAccuracy)), pch=16)

Sample Image

imp %>% 
pivot_longer(cols=matches("Mean")) %>%
ggplot(aes(value, vars)) +
geom_point() +
facet_grid(. ~ name) +
scale_x_continuous(limits=c(0,NA), expand=expansion(c(0,0.04))) +
theme_bw() +
theme(panel.grid.minor=element_blank(),
panel.grid.major.x=element_blank(),
panel.grid.major.y=element_line(),
axis.title=element_blank())

Sample Image

You could also plot the values themselves instead of point markers. For example:

imp %>% 
pivot_longer(cols=matches("Mean")) %>%
ggplot(aes(value, vars)) +
geom_text(aes(label=round(value,1)), size=3) +
facet_grid(. ~ name, scales="free_x") +
scale_x_continuous(limits=c(0,NA), expand=expansion(c(0,0.06))) +
theme_bw() +
theme(panel.grid.minor=element_blank(),
panel.grid.major.x=element_blank(),
panel.grid.major.y=element_line(),
axis.title=element_blank())

Sample Image



Related Topics



Leave a reply



Submit