The Result of Rpart Is Just with 1 Root

result of rpart is a root, but data shows Information Gain

The data you provided does not reflect the ratio of the two target classes, so I've tweaked the data to better reflect that (see Data section):

> prop.table(table(train$Target))

0 1
0.96707581 0.03292419

> 700/27700
[1] 0.02527076

The ratios are now relatively close...

library(rpart)
tree <- rpart(Target ~ ., data=train, method="class")
printcp(tree)

Results in:

Classification tree:
rpart(formula = Target ~ ., data = train, method = "class")

Variables actually used in tree construction:
character(0)

Root node error: 912/27700 = 0.032924

n= 27700

CP nsplit rel error xerror xstd
1 0 0 1 0 0

Now, the reason that you are seeing only the root node for your first model, is probably due to the fact that you have extremely imbalanced target classes, and so, your independent variables could not provide enough information to grow the tree. My sample data has 3.3% event rate, but yours has only about 2.5%!

As you have mentioned, there is a way to force rpart to grow the tree. That is to override the default complexity parameter (cp). The complexity measure is a combination of the size of the tree and how well the tree separates the target classes. From ?rpart.control, "Any split that does not decrease the overall lack of fit by a factor of cp is not attempted". This means that your model at this point does not have a split beyond the root node that decreases the complexity level enough for rpart to take into consideration. We can relax this threshold of what is considered "enough" by either setting a low or a negative cp (negative cp basically forces the tree to grow to its full size).

tree <- rpart(Target ~ ., data=train, method="class" ,parms = list(split = 'information'), 
control =rpart.control(minsplit = 1,minbucket=2, cp=0.00002))
printcp(tree)

Results in:

Classification tree:
rpart(formula = Target ~ ., data = train, method = "class", parms = list(split = "information"),
control = rpart.control(minsplit = 1, minbucket = 2, cp = 2e-05))

Variables actually used in tree construction:
[1] ID V1 V2 V3 V5 V6

Root node error: 912/27700 = 0.032924

n= 27700

CP nsplit rel error xerror xstd
1 4.1118e-04 0 1.00000 1.0000 0.032564
2 3.6550e-04 30 0.98355 1.0285 0.033009
3 3.2489e-04 45 0.97807 1.0702 0.033647
4 3.1328e-04 106 0.95504 1.0877 0.033911
5 2.7412e-04 116 0.95175 1.1031 0.034141
6 2.5304e-04 132 0.94737 1.1217 0.034417
7 2.1930e-04 149 0.94298 1.1458 0.034771
8 1.9936e-04 159 0.94079 1.1502 0.034835
9 1.8275e-04 181 0.93640 1.1645 0.035041
10 1.6447e-04 193 0.93421 1.1864 0.035356
11 1.5664e-04 233 0.92654 1.1853 0.035341
12 1.3706e-04 320 0.91228 1.2083 0.035668
13 1.2183e-04 344 0.90899 1.2127 0.035730
14 9.9681e-05 353 0.90789 1.2237 0.035885
15 2.0000e-05 364 0.90680 1.2259 0.035915

As you can see, the tree has grown to a size that reduces the complexity level by a minimum of cp. Two things to note:

  1. At zero nsplit, CP is already as low as 0.0004, where as the default cp in rpart is set to 0.01.
  2. Starting from nsplit == 0, the cross validation error (xerror) increases as you increase the number of splits.

Both of these indicate that your model is overfitting the data at nsplit == 0 and beyond, since adding more independent variables into your model does not add enough information (insufficient reduction in CP) to reduce the cross validation error. With this being said, your root node model is the best model in this case, which explains why your initial model has only the root node.

pruned.tree <- prune(tree, cp = tree$cptable[which.min(tree$cptable[,"xerror"]),"CP"])
printcp(pruned.tree)

Results in:

Classification tree:
rpart(formula = Target ~ ., data = train, method = "class", parms = list(split = "information"),
control = rpart.control(minsplit = 1, minbucket = 2, cp = 2e-05))

Variables actually used in tree construction:
character(0)

Root node error: 912/27700 = 0.032924

n= 27700

CP nsplit rel error xerror xstd
1 0.00041118 0 1 1 0.032564

As for the pruning part, it is now clearer why your pruned tree is the root node tree, since a tree that goes beyond 0 splits has increasing cross validation error. Taking the tree with the minimum xerror would leave you with root node tree as expected.

Information gain basically tells you how much "information" is added for each split. So technically, every split has some degree of information gain since you are adding more variables into your model (information gain is always non-negative). What you should think about is whether that additional gain (or no gain) reduces the errors enough for you to warrant a more complex model. Hence, the tradeoff between bias and variance.

In this case, it doesn't really make sense for you to reduce cp and later prune the resulting tree. since by setting a low cp, you are telling rpart to make splits even if it overfits, while pruning "cuts" all the nodes that overfits.

Data:

Note that I am shuffling the rows for each column and sample instead of sampling the row indices. This is because the data you provided is probably not a random sample of your original dataset (likely biased), so I am basically randomly creating new observations with combinations of your existing rows which would hopefully reduce that bias.

init_train = structure(list(ID = structure(c(16L, 24L, 29L, 30L, 31L, 1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L,
17L, 18L, 19L, 20L, 21L, 22L, 23L, 25L, 26L, 27L, 28L), .Label = c("SDataID10",
"SDataID11", "SDataID13", "SDataID14", "SDataID15", "SDataID16",
"SDataID17", "SDataID18", "SDataID19", "SDataID20", "SDataID21",
"SDataID24", "SDataID25", "SDataID28", "SDataID29", "SDataID3",
"SDataID31", "SDataID32", "SDataID34", "SDataID35", "SDataID37",
"SDataID38", "SDataID39", "SDataID4", "SDataID43", "SDataID44",
"SDataID45", "SDataID46", "SDataID5", "SDataID7", "SDataID8"), class = "factor"),
V1 = c(161L, 11L, 32L, 13L, 194L, 63L, 89L, 78L, 87L, 81L,
63L, 198L, 9L, 196L, 189L, 116L, 104L, 5L, 173L, 5L, 87L,
5L, 45L, 19L, 133L, 8L, 42L, 45L, 45L, 176L, 63L), V2 = structure(c(1L,
3L, 3L, 1L, 3L, 2L, 1L, 3L, 3L, 1L, 1L, 1L, 3L, 1L, 3L, 2L,
1L, 1L, 3L, 1L, 1L, 1L, 1L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L
), .Label = c("ONE", "THREE", "TWO"), class = "factor"),
V3 = c(1L, 2L, 2L, 1L, 2L, 3L, 1L, 2L, 2L, 1L, 3L, 3L, 3L,
2L, 2L, 3L, 1L, 2L, 3L, 3L, 3L, 2L, 1L, 2L, 2L, 1L, 1L, 1L,
1L, 1L, 1L), V5 = structure(c(1L, 3L, 1L, 3L, 1L, 1L, 1L,
1L, 3L, 3L, 1L, 3L, 3L, 3L, 2L, 4L, 1L, 2L, 1L, 2L, 1L, 3L,
1L, 3L, 1L, 3L, 3L, 3L, 1L, 1L, 3L), .Label = c("FOUR", "ONE",
"THREE", "TWO"), class = "factor"), V6 = c(0L, 2L, 2L, 2L,
0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 2L, 1L, 0L, 0L, 3L, 0L,
3L, 3L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 3L), Target = c(0L,
1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L
)), .Names = c("ID", "V1", "V2", "V3", "V5", "V6", "Target"
), class = "data.frame", row.names = c(NA, -31L))

set.seed(1000)
train = as.data.frame(lapply(init_train, function(x) sample(x, 27700, replace = TRUE)))

rpart stops at root node and does not split further when there is an obvious information gain

As I said in my comment, this is meant to avoid overfitting. Formally, there is the argument minsplit, which is preset to 20 but can be adjusted to give the result you seek:

> library(rpart)
> df <- data.frame(x=rep(c(FALSE,TRUE), each=5), y=c(rep(FALSE,7), rep(TRUE,3)))
> rpart(y ~ x, data=df, minsplit=2)
n= 10

node), split, n, deviance, yval
* denotes terminal node

1) root 10 2.1 0.3
2) x< 0.5 5 0.0 0.0 *
3) x>=0.5 5 1.2 0.6 *

find more arguments to avoice overfitting (i. e. cp and maxdepth) in

help(rpart.control)

EDIT: With method="class" the output changes to

> rpart(y ~ x, data=df, minsplit=2, method="class")
n= 10

node), split, n, loss, yval, (yprob)
* denotes terminal node

1) root 10 3 FALSE (0.7000000 0.3000000)
2) x< 0.5 5 0 FALSE (1.0000000 0.0000000) *
3) x>=0.5 5 2 TRUE (0.4000000 0.6000000) *

The prp() function from rpart in R only plots a single leaf node. Why?

You are getting a tree with a single node because you are using the default settings for rpart. The documentation is a little indirect. The documentation tells you that there is a parameter called control and says "See rpart.control." If you click through to the documentation for rpart.control, you will see that there is a parameter called minsplit which is described as "the minimum number of observations that must exist in a node in order for a split to be attempted." The default value is 20 and you only have 14 data points altogether. It will not split the root node. Instead, use rpart.control to set minsplit to a lower value (try 2).

How to get root node error value from rpart printcp function?

you can get the root error value from the frame component of your fit via:

 fit$frame[1, 'dev']/fit$frame[1, 'n']

or the yval2.V5 entry in the 1st row of fit$frame.



Related Topics



Leave a reply



Submit