Error in R Gbm Function When Cv.Folds > 0

Error in R gbm function when cv.folds 0

I second this solution: The input data in the R function gbm() cannot include the variables (columns) that will not be used in your model.

subscript out of bounds in gbm function

just a hunch since I can't see you data, but I believe that error occurs when you have variable levels that exist in the test set which don't exist in the training set.

this can easily happen when you have a factor variable with a high number of levels, or one level has a low number of instances.

since you're using CV folds, it's possible the holdout set on one of the loops has foreign levels to the training data.

I'd suggest either:

A) use model.matrix() to one-hot encode your factor variables

B) keep setting different seeds until you get a CV split that doesn't have this error occur.

EDIT: yep, with that traceback, your 3rd CV holdout has a factor level in its test set that doesn't exist in the training. so the predict function sees a foreign value and doesn't know what to do.

EDIT 2: Here's a quick example to show what I mean by "factor levels not in the test set"

#Example data with low occurrences of a factor level:

set.seed(222)
data = data.frame(cbind( y = sample(0:1, 10, replace = TRUE), x1 = rnorm(10), x2 = as.factor(sample(0:10, 10, replace = TRUE))))
data$x2 = as.factor(data$x2)
data

      y         x1 x2
 [1,] 1 -0.2468959  2
 [2,] 0 -1.2155609  6
 [3,] 0  1.5614051  1
 [4,] 0  0.4273102  5
 [5,] 1 -1.2010235  5
 [6,] 1  1.0524585  8
 [7,] 0 -1.3050636  6
 [8,] 0 -0.6926076  4
 [9,] 1  0.6026489  3
[10,] 0 -0.1977531  7

#CV fold.  This splits a model to be trained on 80% of the data, then tests against the remaining 20%.  This is a simpler version of what happens when you call gbm's CV fold.

CV_train_rows = sample(1:10, 8, replace = FALSE) ; CV_test_rows = setdiff(1:10, CV_train_rows)
CV_train = data[CV_train_rows,] ; CV_test = data[CV_test_rows,]

#build a model on the training... 

CV_model = lm(y ~ ., data = CV_train)
summary(CV_model)
#note here: as the model has been built, it was only fed factor levels (3, 4, 5, 6, 7, 8) for variable x2

CV_test$x2
#in the test set, there are only levels 1 and 2.

#attempt to predict on the test set
predict(CV_model, CV_test)

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
factor x2 has new levels 1, 2

Object p not found when running gbm()

It's not meant to deal with the situation someone sets cv.folds = 1. By definition, k fold means splitting the data into k parts, training on 1 part and testing on the other.. So I am not so sure what is 1 -fold cross validation, and if you look at the code for gbm, at line 437

  if(cv.folds > 1) {
    cv.results <- gbmCrossVal(cv.folds = cv.folds, nTrain = nTrain,
    ....
    p <- cv.results$predictions
}

It makes the predictions and when it collects the results into gbm, line 471:

  if (cv.folds > 0) { 
    gbm.obj$cv.fitted <- p 
  }

So if cv.folds ==1, p is not calculated, but it is > 0 hence you get the error.

Below is a reproducible example:

library(MASS)
test = Pima.tr 
test$type = as.numeric(test$type)-1

model_output <- gbm(type~ . , 
                  distribution = "bernoulli",
                  var.monotone = rep(0,7),
                  data = test,
                  train.fraction = 0.5,
                  n.cores = 1,
                  n.trees = 30,
                  cv.folds = 1,
                  keep.data = TRUE,
                  verbose=TRUE)

gives me the error object 'p' not found

Set it to cv.folds = 2, and it runs smoothly....

model_output <- gbm(type~ . , 
                  distribution = "bernoulli",
                  var.monotone = rep(0,7),
                  data = test,
                  train.fraction = 0.5,
                  n.cores = 1,
                  n.trees = 30,
                  cv.folds = 2,
                  keep.data = TRUE,
                  verbose=TRUE)

Caret and GBM Errors

There were two issues: passing cv.folds caused a problem. Also, you don't need to convert the outcome to a binary number; this causes train to think that it is a regression problem. The idea behind the train function is to smooth out the inconsistencies with the modeling functions, so we use factors for classification and numbers for regression.

GBM error in classification bernoulli distribution

If you check out the vignette of gbm:

distribution: Either a character string specifying the name of the
          distribution to use or a list with a component ‘name’
          specifying the distribution and any additional parameters
          needed. If not specified, ‘gbm’ will try to guess: if the
          response has only 2 unique values, bernoulli is assumed;
          otherwise, if the response is a factor, multinomial is
          assumed

If you only have two classes, you don't need to convert it into a factor. We can explore this with iris example, where I create a group label 0/1 :

library(gbm)
df = iris
df$Group = factor(as.numeric(df$Species=="versicolor"))
df$Species = NULL
 
mod_gbm <- gbm(Group~.,distribution ="bernoulli", data=df,cv.folds=5)
Error in res[flag, ] <- predictions : replacement has length zero

I get the same error. So we convert it to numeric 0/1 and you can see it works correctly.

When the variable is a factor, doing as.numeric() converts it to 1,2 with 1 corresponding to the first level. So this case, since Group is 0/1 to start with:

df$Group = as.numeric(df$Group)-1
mod_gbm <- gbm(Group~.,distribution ="bernoulli", data=df,cv.folds=5)

And we get the predictions:

pred = ifelse(predict(mod_gbm,type="response")>0.5,1,0)
table(pred,df$Group)

    
pred  0  1
   0 98  3
   1  2 47

Error in R Gbm Function When Cv.Folds > 0