result of rpart is a root, but data shows Information Gain
The data you provided does not reflect the ratio of the two target classes, so I've tweaked the data to better reflect that (see Data section):
> prop.table(table(train$Target))
0 1
0.96707581 0.03292419
> 700/27700
[1] 0.02527076
The ratios are now relatively close...
library(rpart)
tree <- rpart(Target ~ ., data=train, method="class")
printcp(tree)
Results in:
Classification tree:
rpart(formula = Target ~ ., data = train, method = "class")
Variables actually used in tree construction:
character(0)
Root node error: 912/27700 = 0.032924
n= 27700
CP nsplit rel error xerror xstd
1 0 0 1 0 0
Now, the reason that you are seeing only the root node for your first model, is probably due to the fact that you have extremely imbalanced target classes, and so, your independent variables could not provide enough information to grow the tree. My sample data has 3.3% event rate, but yours has only about 2.5%!
As you have mentioned, there is a way to force rpart
to grow the tree. That is to override the default complexity parameter (cp
). The complexity measure is a combination of the size of the tree and how well the tree separates the target classes. From ?rpart.control
, "Any split that does not decrease the overall lack of fit by a factor of cp is not attempted". This means that your model at this point does not have a split beyond the root node that decreases the complexity level enough for rpart
to take into consideration. We can relax this threshold of what is considered "enough" by either setting a low or a negative cp
(negative cp
basically forces the tree to grow to its full size).
tree <- rpart(Target ~ ., data=train, method="class" ,parms = list(split = 'information'),
control =rpart.control(minsplit = 1,minbucket=2, cp=0.00002))
printcp(tree)
Results in:
Classification tree:
rpart(formula = Target ~ ., data = train, method = "class", parms = list(split = "information"),
control = rpart.control(minsplit = 1, minbucket = 2, cp = 2e-05))
Variables actually used in tree construction:
[1] ID V1 V2 V3 V5 V6
Root node error: 912/27700 = 0.032924
n= 27700
CP nsplit rel error xerror xstd
1 4.1118e-04 0 1.00000 1.0000 0.032564
2 3.6550e-04 30 0.98355 1.0285 0.033009
3 3.2489e-04 45 0.97807 1.0702 0.033647
4 3.1328e-04 106 0.95504 1.0877 0.033911
5 2.7412e-04 116 0.95175 1.1031 0.034141
6 2.5304e-04 132 0.94737 1.1217 0.034417
7 2.1930e-04 149 0.94298 1.1458 0.034771
8 1.9936e-04 159 0.94079 1.1502 0.034835
9 1.8275e-04 181 0.93640 1.1645 0.035041
10 1.6447e-04 193 0.93421 1.1864 0.035356
11 1.5664e-04 233 0.92654 1.1853 0.035341
12 1.3706e-04 320 0.91228 1.2083 0.035668
13 1.2183e-04 344 0.90899 1.2127 0.035730
14 9.9681e-05 353 0.90789 1.2237 0.035885
15 2.0000e-05 364 0.90680 1.2259 0.035915
As you can see, the tree has grown to a size that reduces the complexity level by a minimum of cp
. Two things to note:
- At zero
nsplit
,CP
is already as low as 0.0004, where as the defaultcp
inrpart
is set to 0.01. - Starting from
nsplit == 0
, the cross validation error (xerror
) increases as you increase the number of splits.
Both of these indicate that your model is overfitting the data at nsplit == 0
and beyond, since adding more independent variables into your model does not add enough information (insufficient reduction in CP) to reduce the cross validation error. With this being said, your root node model is the best model in this case, which explains why your initial model has only the root node.
pruned.tree <- prune(tree, cp = tree$cptable[which.min(tree$cptable[,"xerror"]),"CP"])
printcp(pruned.tree)
Results in:
Classification tree:
rpart(formula = Target ~ ., data = train, method = "class", parms = list(split = "information"),
control = rpart.control(minsplit = 1, minbucket = 2, cp = 2e-05))
Variables actually used in tree construction:
character(0)
Root node error: 912/27700 = 0.032924
n= 27700
CP nsplit rel error xerror xstd
1 0.00041118 0 1 1 0.032564
As for the pruning part, it is now clearer why your pruned tree is the root node tree, since a tree that goes beyond 0 splits has increasing cross validation error. Taking the tree with the minimum xerror
would leave you with root node tree as expected.
Information gain basically tells you how much "information" is added for each split. So technically, every split has some degree of information gain since you are adding more variables into your model (information gain is always non-negative). What you should think about is whether that additional gain (or no gain) reduces the errors enough for you to warrant a more complex model. Hence, the tradeoff between bias and variance.
In this case, it doesn't really make sense for you to reduce cp
and later prune the resulting tree. since by setting a low cp
, you are telling rpart
to make splits even if it overfits, while pruning "cuts" all the nodes that overfits.
Data:
Note that I am shuffling the rows for each column and sample instead of sampling the row indices. This is because the data you provided is probably not a random sample of your original dataset (likely biased), so I am basically randomly creating new observations with combinations of your existing rows which would hopefully reduce that bias.
init_train = structure(list(ID = structure(c(16L, 24L, 29L, 30L, 31L, 1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L,
17L, 18L, 19L, 20L, 21L, 22L, 23L, 25L, 26L, 27L, 28L), .Label = c("SDataID10",
"SDataID11", "SDataID13", "SDataID14", "SDataID15", "SDataID16",
"SDataID17", "SDataID18", "SDataID19", "SDataID20", "SDataID21",
"SDataID24", "SDataID25", "SDataID28", "SDataID29", "SDataID3",
"SDataID31", "SDataID32", "SDataID34", "SDataID35", "SDataID37",
"SDataID38", "SDataID39", "SDataID4", "SDataID43", "SDataID44",
"SDataID45", "SDataID46", "SDataID5", "SDataID7", "SDataID8"), class = "factor"),
V1 = c(161L, 11L, 32L, 13L, 194L, 63L, 89L, 78L, 87L, 81L,
63L, 198L, 9L, 196L, 189L, 116L, 104L, 5L, 173L, 5L, 87L,
5L, 45L, 19L, 133L, 8L, 42L, 45L, 45L, 176L, 63L), V2 = structure(c(1L,
3L, 3L, 1L, 3L, 2L, 1L, 3L, 3L, 1L, 1L, 1L, 3L, 1L, 3L, 2L,
1L, 1L, 3L, 1L, 1L, 1L, 1L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L
), .Label = c("ONE", "THREE", "TWO"), class = "factor"),
V3 = c(1L, 2L, 2L, 1L, 2L, 3L, 1L, 2L, 2L, 1L, 3L, 3L, 3L,
2L, 2L, 3L, 1L, 2L, 3L, 3L, 3L, 2L, 1L, 2L, 2L, 1L, 1L, 1L,
1L, 1L, 1L), V5 = structure(c(1L, 3L, 1L, 3L, 1L, 1L, 1L,
1L, 3L, 3L, 1L, 3L, 3L, 3L, 2L, 4L, 1L, 2L, 1L, 2L, 1L, 3L,
1L, 3L, 1L, 3L, 3L, 3L, 1L, 1L, 3L), .Label = c("FOUR", "ONE",
"THREE", "TWO"), class = "factor"), V6 = c(0L, 2L, 2L, 2L,
0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 2L, 1L, 0L, 0L, 3L, 0L,
3L, 3L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 3L), Target = c(0L,
1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L
)), .Names = c("ID", "V1", "V2", "V3", "V5", "V6", "Target"
), class = "data.frame", row.names = c(NA, -31L))
set.seed(1000)
train = as.data.frame(lapply(init_train, function(x) sample(x, 27700, replace = TRUE)))
rpart stops at root node and does not split further when there is an obvious information gain
As I said in my comment, this is meant to avoid overfitting. Formally, there is the argument minsplit
, which is preset to 20 but can be adjusted to give the result you seek:
> library(rpart)
> df <- data.frame(x=rep(c(FALSE,TRUE), each=5), y=c(rep(FALSE,7), rep(TRUE,3)))
> rpart(y ~ x, data=df, minsplit=2)
n= 10
node), split, n, deviance, yval
* denotes terminal node
1) root 10 2.1 0.3
2) x< 0.5 5 0.0 0.0 *
3) x>=0.5 5 1.2 0.6 *
find more arguments to avoice overfitting (i. e. cp
and maxdepth
) in
help(rpart.control)
EDIT: With method="class" the output changes to
> rpart(y ~ x, data=df, minsplit=2, method="class")
n= 10
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 10 3 FALSE (0.7000000 0.3000000)
2) x< 0.5 5 0 FALSE (1.0000000 0.0000000) *
3) x>=0.5 5 2 TRUE (0.4000000 0.6000000) *
The prp() function from rpart in R only plots a single leaf node. Why?
You are getting a tree with a single node because you are using the default settings for rpart
. The documentation is a little indirect. The documentation tells you that there is a parameter called control
and says "See rpart.control." If you click through to the documentation for rpart.control, you will see that there is a parameter called minsplit
which is described as "the minimum number of observations that must exist in a node in order for a split to be attempted." The default value is 20 and you only have 14 data points altogether. It will not split the root node. Instead, use rpart.control
to set minsplit
to a lower value (try 2).
How to get root node error value from rpart printcp function?
you can get the root error value from the frame
component of your fit
via:
fit$frame[1, 'dev']/fit$frame[1, 'n']
or the yval2.V5
entry in the 1st row of fit$frame
.
Related Topics
How to Request an Early Exit When Knitting an Rmd Document
Time-Series - Data Splitting and Model Evaluation
How to Merge Two Data.Table by Different Column Names
Using Dplyr to Conditionally Replace Values in a Column
Sort Matrix According to First Column in R
Filtering Observations in Dplyr in Combination with Grepl
Shared Memory in Parallel Foreach in R
Transparent Equivalent of Given Color
How to Count How Many Values Per Level in a Given Factor
Automatic Documentation of Datasets
How to Highlight Time Ranges on a Plot
How to Write from R to the Clipboard on a MAC
Regression Tables in Markdown Format (For Flexible Use in R Markdown V2)
Change Internal Function of a Package
R Dplyr Join by Range or Virtual Column