C5.0 decision tree - c50 code called exit with value 1
For anyone interested, the data can be found here: http://www.kaggle.com/c/titanic-gettingStarted/data. I think you need to be registered in order to download it.
Regarding your problem, first of I think you meant to write
new_model <- C5.0(train[,-2],train$Survived)
Next, notice the structure of the Cabin
and Embarked
Columns. These two factors have an empty character as a level name (check with levels(train$Embarked)
). This is the point where C50
falls over. If you modify your data such that
levels(train$Cabin)[1] = "missing"
levels(train$Embarked)[1] = "missing"
your algorithm will now run without an error.
C50 failed in r with c50 code called exit with value 1
The problem is the variable name region
-- I think C5.0 doesn't like the colons in there. I recreated your dataset with:
region <- c("AL","AR","AZ","CA","CO","CT","DC","FL")
And then it worked with no errors:
treeModel <- C5.0(x=test_set[,-15],y=test_set[,15])
treeModel
...
Evaluation on training data (1000 cases):
Decision Tree
----------------
Size Errors
103 220(22.0%) <<
(a) (b) <-classified as
---- ----
358 122 (a): class 1
98 422 (b): class 2
Attribute usage:
100.00% user_hour
28.30% region
27.30% dma
24.30% city
17.60% user_day
15.40% size
12.70% placement
9.10% user_group
7.90% browser
6.50% os_extended
4.70% publisher
4.40% position
3.70% domain
3.00% seller_memeber_id
I also recoded the dependent variable as 1
and 2
just in case the string with the ranges was giving it a problem, but that didn't seem to matter at all (however in the output above you'll see that it predicted to Class 1 and Class 2, and that's why).
C50 code called exit with value 1 (using factor decision variable a non empty values)
You need to clean your data in a few ways.
- Remove the unnecessary columns with only one level. They contain no information and lead to problems.
- Convert the class of the target variable
rorIn$Readmit
into a factor. - Separate the target variable from the data set that you supply for the training.
This should work:
rorIn <- read.csv("RoRdataInputData_v1.6.csv", header=TRUE)
rorIn$Readmit <- as.factor(rorIn$Readmit)
library(Hmisc)
singleLevelVars <- names(rorIn)[contents(rorIn)$contents$Levels == 1]
trainvars <- setdiff(colnames(rorIn), c("Readmit", singleLevelVars))
library(C50)
RoRmodel <- C5.0(rorIn[,trainvars], rorIn$Readmit,trials = 10)
predict(RoRmodel, rorIn[,trainvars])
#[1] 1 0 1 0 0 0 1 0
#Levels: 0 1
You can then evaluate accuracy, recall, and other statistics by comparing this predicted result with the actual value of the target variable:
rorIn$Readmit
#[1] 1 0 1 0 1 0 1 0
#Levels: 0 1
The usual way is to set up a confusion matrix to compare actual and predicted values in binary classification problems. In the case of this small data set one can easily see that there is only one false negative result. So the code seems to work pretty well, but this encouraging result can be deceptive due to the very small number of observations.
library(gmodels)
actual <- rorIn$Readmit
predicted <- predict(RoRmodel,rorIn[,trainvars])
CrossTable(actual,predicted, prop.chisq=FALSE,prop.r=FALSE)
# Total Observations in Table: 8
#
#
# | predicted
# actual | 0 | 1 | Row Total |
#--------------|-----------|-----------|-----------|
# 0 | 4 | 0 | 4 |
# | 0.800 | 0.000 | |
# | 0.500 | 0.000 | |
#--------------|-----------|-----------|-----------|
# 1 | 1 | 3 | 4 |
# | 0.200 | 1.000 | |
# | 0.125 | 0.375 | |
#--------------|-----------|-----------|-----------|
# Column Total | 5 | 3 | 8 |
# | 0.625 | 0.375 | |
#--------------|-----------|-----------|-----------|
On a larger data set it would be useful, if not necessary, to separate the set into training data and test data. There is a lot of good literature on machine learning that will help you in fine-tuning the model and its predictions.
c50 code called exit with value 1 on Mushroom Data set
f <-file("https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data", open="r")
data <- read.table(f, sep=",", header=F)
str(data)
pacman::p_load(C50)
C5.model <- C5.0(data[1:10000,c(2:16,18:23)],data[1:10000,1],trials = 3,na.action = na.pass)
Column 17 was the cause of this problem as it had no identifying variation.
Related Topics
Generate Ggplot2 Boxplot with Different Colours for Multiple Groups
Implementation of Standard Recycling Rules
Get the Index of the Values of One Vector in Another
Using R to Download Newest Files from Ftp-Server
Trouble Passing on an Argument to Function Within Own Function
How to Select Non-Numeric Columns Using Dplyr::Select_If
R Grep Pattern Regex with Brackets
Set a Functions Environment to That of the Calling Environment (Parent.Frame) from Within Function
Cbind: How to Have Missing Values Set to Na
Remove Part of a String in Dataframe Column (R)
Dealing with Spaces and "Weird" Characters in Column Names with Dplyr::Rename()
How to Calculate Mean of All Columns, by Group
Ggplot2: Using Gtable to Move Strip Labels to Top of Panel for Facet_Grid
Add Axis Tick-Marks on Top and to the Right to a Ggplot
Creating Accompanying Slides for Bookdown Project
How to Read Huge CSV File into R by Row Condition
Keeping Only Certain Rows of a Data Frame Based on a Set of Values