How to Prune a Tree in R

How to prune a tree in R?

You have used the minimum cross-validated error tree. An alternative is to use the smallest tree that is within 1 standard error of the best tree (the one you are selecting). The reason for this is that, given the CV estimates of the error, the smallest tree within 1 standard error is doing just as good a job at prediction as the best (lowest CV error) tree, yet it is doing it with fewer "terms".

Plot the cost-complexity vs tree size for the un-pruned tree via:

plotcp(tree)

Find the tree to the left of the one with minimum error whose cp value lies within the error bar of one with minimum error.

There could be many reasons why pruning is not affecting the fitted tree. For example the best tree could be the one where the algorithm stopped according to the stopping rules as specified in ?rpart.control.

Error in prune.tree can not prune singlenode tree in R.tree

This error is generated by cv.tree when the tree is completely pruned and only the root node remains. I can reproduce your error when generating a set of X variables not associate to Y.

library(tree)

# Data generating process
# Y is NOT associated to any X variables
set.seed(1234)
X <- matrix(rnorm(7499*18), ncol=18)
Y <- rbinom(7499, 1, 0.5)
data <- data.frame(Y=factor(Y, labels=c("No","Yes")), X)
idx <- sample(1:nrow(data), 6000)
data.train <- data[idx,]
# Train the tree
tree.data = tree(Y~., data.train,
control=tree.control(dim(data)[1], mincut = 10, minsize = 20, mindev = 0.001))
plot(tree.data)
text(tree.data, pretty = 0,cex=0.6)
# Pruning by cv.tree
cv.data = cv.tree(tree.data, FUN = prune.misclass)

And the error message is:

Error in prune.tree(tree = list(frame = list(var = 1L, n = 4842, dev =
6712.03745626047, : can not prune singlenode tree

Suppose now that X1 is associated to Y.

# Data generating process
set.seed(1234)
X <- matrix(rnorm(7499*18), ncol=18)
Y <- X[,1]>0 + rbinom(7499, 1, 0.2)
data <- data.frame(Y=factor(Y, labels=c("No","Yes")), X)
idx <- sample(1:nrow(data), 6000)
data.train <- data[idx,]

the cv.tree command now does not throw errors:

# Pruning by cv.tree
cv.data = cv.tree(tree.data, FUN = prune.misclass)
pruned.tree <- prune.tree(tree.data, k=cv.data$k[3])
plot(pruned.tree)
text(pruned.tree, pretty = 0, cex=0.6)

Sample Image

R: Pruning data.tree without altering

After some fiddeling around I finally tried the following, and made it work. There are no good examples, so I thought I would leave one here.

print(acme, "cost", pruneFun = function(node) Pruner(node)) 
levelName cost
1 Acme Inc. 4950000
2 ¦--Accounting 1500000
3 ¦--Research 2750000
4 °--IT 700000

tree cross validation of tree package in r

The eight things that are displayed in the output are not the folds from the cross-validation. The documentation for cv.tree says of the output:

Value

A copy of FUN applied to object, with component dev replaced by the cross-validated
results from the sum of the dev components of each fit.

Since you did not specify the FUN argument to cv.tree, you get the default prune.tree. What is the output of prune.tree? The documentation says:

Determines a nested sequence of subtrees of the supplied tree by
recursively "snipping" off the least important splits, based upon the
cost-complexity measure. prune.misclass is an abbreviation for
prune.tree(method = "misclass") for use with cv.tree.

Notice that your tree has exactly 8 leaves.

plot(tree.boston)
text(tree.boston)

Plot of tree

prune.tree is showing you the deviance of the eight trees, snipping off the leaves one by one. cv.tree is showing you a cross-validated version of this. Instead of computing the deviance on the full training data, it uses cross-validated values for each of the eight successive prunings.

Compare the deviance in the outputs of just using prune.tree with the cross validated deviance.

prune.tree(tree.boston)

$dev
[1] 3098.610 3354.268 3806.195 4574.704 5393.592 6952.719 11229.299
[8] 20894.657

cv.tree(tree.boston, K=5)

$dev
[1] 4768.281 4783.625 5718.441 6309.655 6329.011 7078.719 12907.505
[8] 20974.393

Notice that the cross-validated values are rather higher at every step. Just using prune.tree tests on the training data and so under-reports the deviance. The cv values are more realistic.



Related Topics



Leave a reply



Submit