Error in R gbm function when cv.folds 0
I second this solution: The input data in the R function gbm() cannot include the variables (columns) that will not be used in your model.
subscript out of bounds in gbm function
just a hunch since I can't see you data, but I believe that error occurs when you have variable levels that exist in the test set which don't exist in the training set.
this can easily happen when you have a factor variable with a high number of levels, or one level has a low number of instances.
since you're using CV folds, it's possible the holdout set on one of the loops has foreign levels to the training data.
I'd suggest either:
A) use model.matrix() to one-hot encode your factor variables
B) keep setting different seeds until you get a CV split that doesn't have this error occur.
EDIT: yep, with that traceback, your 3rd CV holdout has a factor level in its test set that doesn't exist in the training. so the predict function sees a foreign value and doesn't know what to do.
EDIT 2: Here's a quick example to show what I mean by "factor levels not in the test set"
#Example data with low occurrences of a factor level:
set.seed(222)
data = data.frame(cbind( y = sample(0:1, 10, replace = TRUE), x1 = rnorm(10), x2 = as.factor(sample(0:10, 10, replace = TRUE))))
data$x2 = as.factor(data$x2)
data
y x1 x2
[1,] 1 -0.2468959 2
[2,] 0 -1.2155609 6
[3,] 0 1.5614051 1
[4,] 0 0.4273102 5
[5,] 1 -1.2010235 5
[6,] 1 1.0524585 8
[7,] 0 -1.3050636 6
[8,] 0 -0.6926076 4
[9,] 1 0.6026489 3
[10,] 0 -0.1977531 7
#CV fold. This splits a model to be trained on 80% of the data, then tests against the remaining 20%. This is a simpler version of what happens when you call gbm's CV fold.
CV_train_rows = sample(1:10, 8, replace = FALSE) ; CV_test_rows = setdiff(1:10, CV_train_rows)
CV_train = data[CV_train_rows,] ; CV_test = data[CV_test_rows,]
#build a model on the training...
CV_model = lm(y ~ ., data = CV_train)
summary(CV_model)
#note here: as the model has been built, it was only fed factor levels (3, 4, 5, 6, 7, 8) for variable x2
CV_test$x2
#in the test set, there are only levels 1 and 2.
#attempt to predict on the test set
predict(CV_model, CV_test)
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor x2 has new levels 1, 2
Object p not found when running gbm()
It's not meant to deal with the situation someone sets cv.folds = 1. By definition, k fold means splitting the data into k parts, training on 1 part and testing on the other.. So I am not so sure what is 1 -fold cross validation, and if you look at the code for gbm, at line 437
if(cv.folds > 1) {
cv.results <- gbmCrossVal(cv.folds = cv.folds, nTrain = nTrain,
....
p <- cv.results$predictions
}
It makes the predictions and when it collects the results into gbm, line 471:
if (cv.folds > 0) {
gbm.obj$cv.fitted <- p
}
So if cv.folds ==1, p is not calculated, but it is > 0 hence you get the error.
Below is a reproducible example:
library(MASS)
test = Pima.tr
test$type = as.numeric(test$type)-1
model_output <- gbm(type~ . ,
distribution = "bernoulli",
var.monotone = rep(0,7),
data = test,
train.fraction = 0.5,
n.cores = 1,
n.trees = 30,
cv.folds = 1,
keep.data = TRUE,
verbose=TRUE)
gives me the error object 'p' not found
Set it to cv.folds = 2, and it runs smoothly....
model_output <- gbm(type~ . ,
distribution = "bernoulli",
var.monotone = rep(0,7),
data = test,
train.fraction = 0.5,
n.cores = 1,
n.trees = 30,
cv.folds = 2,
keep.data = TRUE,
verbose=TRUE)
Caret and GBM Errors
There were two issues: passing cv.folds
caused a problem. Also, you don't need to convert the outcome to a binary number; this causes train
to think that it is a regression problem. The idea behind the train
function is to smooth out the inconsistencies with the modeling functions, so we use factors for classification and numbers for regression.
GBM error in classification bernoulli distribution
If you check out the vignette of gbm
:
distribution: Either a character string specifying the name of the
distribution to use or a list with a component ‘name’
specifying the distribution and any additional parameters
needed. If not specified, ‘gbm’ will try to guess: if the
response has only 2 unique values, bernoulli is assumed;
otherwise, if the response is a factor, multinomial is
assumed
If you only have two classes, you don't need to convert it into a factor. We can explore this with iris example, where I create a group label 0/1 :
library(gbm)
df = iris
df$Group = factor(as.numeric(df$Species=="versicolor"))
df$Species = NULL
mod_gbm <- gbm(Group~.,distribution ="bernoulli", data=df,cv.folds=5)
Error in res[flag, ] <- predictions : replacement has length zero
I get the same error. So we convert it to numeric 0/1 and you can see it works correctly.
When the variable is a factor, doing as.numeric()
converts it to 1,2 with 1 corresponding to the first level. So this case, since Group is 0/1 to start with:
df$Group = as.numeric(df$Group)-1
mod_gbm <- gbm(Group~.,distribution ="bernoulli", data=df,cv.folds=5)
And we get the predictions:
pred = ifelse(predict(mod_gbm,type="response")>0.5,1,0)
table(pred,df$Group)
pred 0 1
0 98 3
1 2 47
Related Topics
Multiple Filled.Contour Plots in One Graph Using with Par(Mfrow=C())
How Can Library() Accept Both Quoted and Unquoted Strings
Knitr: How to Use Child .Rnw Docs with (Relative) Figure Paths
Logistic Regression with Robust Clustered Standard Errors in R
R - How to Add Row Index to a Data Frame, Based on Combination of Factors
Linear Models in R with Different Combinations of Variables
How to Show the Progress of Code in R
Fama MACbeth Standard Errors in R
Scale_Y_Log10() and Coord_Trans(Ytrans = 'Log10') Lead to Different Results
Mapping Specific States and Provinces in R
Reshape Long Structured Data.Table into a Wide Structure Using Data.Table Functionality
Pretty Axis Labels for Log Scale in Ggplot
How to Sweep Specific Columns with Dplyr
Oauth Authentification to Fitbit Using Httr
How to Change the Order of the Panels in Simple Lattice Graphs