C5.0 Decision Tree - C50 Code Called Exit with Value 1

C5.0 decision tree - c50 code called exit with value 1

For anyone interested, the data can be found here: http://www.kaggle.com/c/titanic-gettingStarted/data. I think you need to be registered in order to download it.

Regarding your problem, first of I think you meant to write

new_model <- C5.0(train[,-2],train$Survived)

Next, notice the structure of the Cabin and Embarked Columns. These two factors have an empty character as a level name (check with levels(train$Embarked)). This is the point where C50 falls over. If you modify your data such that

levels(train$Cabin)[1] = "missing"
levels(train$Embarked)[1] = "missing"

your algorithm will now run without an error.

C50 failed in r with c50 code called exit with value 1

The problem is the variable name region -- I think C5.0 doesn't like the colons in there. I recreated your dataset with:

region <- c("AL","AR","AZ","CA","CO","CT","DC","FL")

And then it worked with no errors:

treeModel <- C5.0(x=test_set[,-15],y=test_set[,15])
treeModel

...
Evaluation on training data (1000 cases):

Decision Tree
----------------
Size Errors

103 220(22.0%) <<

(a) (b) <-classified as
---- ----
358 122 (a): class 1
98 422 (b): class 2

Attribute usage:

100.00% user_hour
28.30% region
27.30% dma
24.30% city
17.60% user_day
15.40% size
12.70% placement
9.10% user_group
7.90% browser
6.50% os_extended
4.70% publisher
4.40% position
3.70% domain
3.00% seller_memeber_id

I also recoded the dependent variable as 1 and 2 just in case the string with the ranges was giving it a problem, but that didn't seem to matter at all (however in the output above you'll see that it predicted to Class 1 and Class 2, and that's why).

C50 code called exit with value 1 (using factor decision variable a non empty values)

You need to clean your data in a few ways.

  • Remove the unnecessary columns with only one level. They contain no information and lead to problems.
  • Convert the class of the target variable rorIn$Readmit into a factor.
  • Separate the target variable from the data set that you supply for the training.

This should work:

rorIn <- read.csv("RoRdataInputData_v1.6.csv", header=TRUE) 
rorIn$Readmit <- as.factor(rorIn$Readmit)
library(Hmisc)
singleLevelVars <- names(rorIn)[contents(rorIn)$contents$Levels == 1]
trainvars <- setdiff(colnames(rorIn), c("Readmit", singleLevelVars))
library(C50)
RoRmodel <- C5.0(rorIn[,trainvars], rorIn$Readmit,trials = 10)
predict(RoRmodel, rorIn[,trainvars])
#[1] 1 0 1 0 0 0 1 0
#Levels: 0 1

You can then evaluate accuracy, recall, and other statistics by comparing this predicted result with the actual value of the target variable:

rorIn$Readmit
#[1] 1 0 1 0 1 0 1 0
#Levels: 0 1

The usual way is to set up a confusion matrix to compare actual and predicted values in binary classification problems. In the case of this small data set one can easily see that there is only one false negative result. So the code seems to work pretty well, but this encouraging result can be deceptive due to the very small number of observations.

library(gmodels)
actual <- rorIn$Readmit
predicted <- predict(RoRmodel,rorIn[,trainvars])
CrossTable(actual,predicted, prop.chisq=FALSE,prop.r=FALSE)
# Total Observations in Table: 8
#
#
# | predicted
# actual | 0 | 1 | Row Total |
#--------------|-----------|-----------|-----------|
# 0 | 4 | 0 | 4 |
# | 0.800 | 0.000 | |
# | 0.500 | 0.000 | |
#--------------|-----------|-----------|-----------|
# 1 | 1 | 3 | 4 |
# | 0.200 | 1.000 | |
# | 0.125 | 0.375 | |
#--------------|-----------|-----------|-----------|
# Column Total | 5 | 3 | 8 |
# | 0.625 | 0.375 | |
#--------------|-----------|-----------|-----------|

On a larger data set it would be useful, if not necessary, to separate the set into training data and test data. There is a lot of good literature on machine learning that will help you in fine-tuning the model and its predictions.

c50 code called exit with value 1 on Mushroom Data set

f <-file("https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data", open="r")
data <- read.table(f, sep=",", header=F)
str(data)

pacman::p_load(C50)
C5.model <- C5.0(data[1:10000,c(2:16,18:23)],data[1:10000,1],trials = 3,na.action = na.pass)

Column 17 was the cause of this problem as it had no identifying variation.



Related Topics



Leave a reply



Submit