Xgboost in R: How Does Xgb.Cv Pass the Optimal Parameters into Xgb.Train

xgboost in R: how does xgb.cv pass the optimal parameters into xgb.train

Looks like you misunderstood xgb.cv, it is not a parameter searching function. It does k-folds cross validation, nothing more.

In your code, it does not change the value of param.

To find best parameters in R's XGBoost, there are some methods. These are 2 methods,

(1) Use mlr package, http://mlr-org.github.io/mlr-tutorial/release/html/

There is a XGBoost + mlr example code in the Kaggle's Prudential challenge,

But that code is for regression, not classification. As far as I know, there is no mlogloss metric yet in mlr package, so you must code the mlogloss measurement from scratch by yourself. CMIIW.

(2) Second method, by manually setting the parameters then repeat, example,

param <- list(objective = "multi:softprob",
      eval_metric = "mlogloss",
      num_class = 12,
      max_depth = 8,
      eta = 0.05,
      gamma = 0.01, 
      subsample = 0.9,
      colsample_bytree = 0.8, 
      min_child_weight = 4,
      max_delta_step = 1
      )
cv.nround = 1000
cv.nfold = 5
mdcv <- xgb.cv(data=dtrain, params = param, nthread=6, 
                nfold=cv.nfold, nrounds=cv.nround,
                verbose = T)

Then, you find the best (minimum) mlogloss,

min_logloss = min(mdcv[, test.mlogloss.mean])
min_logloss_index = which.min(mdcv[, test.mlogloss.mean])

min_logloss is the minimum value of mlogloss, while min_logloss_index is the index (round).

You must repeat the process above several times, each time change the parameters manually (mlr does the repeat for you). Until finally you get best global minimum min_logloss.

Note: You can do it in a loop of 100 or 200 iterations, in which for each iteration you set the parameters value randomly. This way, you must save the best [parameters_list, min_logloss, min_logloss_index] in variables or in a file.

Note: better to set random seed by set.seed() for reproducible result. Different random seed yields different result. So, you must save [parameters_list, min_logloss, min_logloss_index, seednumber] in the variables or file.

Say that finally you get 3 results in 3 iterations/repeats:

min_logloss = 2.1457, min_logloss_index = 840
min_logloss = 2.2293, min_logloss_index = 920
min_logloss = 1.9745, min_logloss_index = 780

Then you must use the third parameters (it has global minimum min_logloss of 1.9745). Your best index (nrounds) is 780.

Once you get best parameters, use it in the training,

# best_param is global best param with minimum min_logloss
# best_min_logloss_index is the global minimum logloss index
nround = 780
md <- xgb.train(data=dtrain, params=best_param, nrounds=nround, nthread=6)

I don't think you need watchlist in the training, because you have done the cross validation. But if you still want to use watchlist, it is just okay.

Even better you can use early stopping in xgb.cv.

mdcv <- xgb.cv(data=dtrain, params=param, nthread=6, 
                nfold=cv.nfold, nrounds=cv.nround,
                verbose = T, early.stop.round=8, maximize=FALSE)

With this code, when mlogloss value is not decreasing in 8 steps, the xgb.cv will stop. You can save time. You must set maximize to FALSE, because you expect minimum mlogloss.

Here is an example code, with 100 iterations loop, and random chosen parameters.

best_param = list()
best_seednumber = 1234
best_logloss = Inf
best_logloss_index = 0

for (iter in 1:100) {
    param <- list(objective = "multi:softprob",
          eval_metric = "mlogloss",
          num_class = 12,
          max_depth = sample(6:10, 1),
          eta = runif(1, .01, .3),
          gamma = runif(1, 0.0, 0.2), 
          subsample = runif(1, .6, .9),
          colsample_bytree = runif(1, .5, .8), 
          min_child_weight = sample(1:40, 1),
          max_delta_step = sample(1:10, 1)
          )
    cv.nround = 1000
    cv.nfold = 5
    seed.number = sample.int(10000, 1)[[1]]
    set.seed(seed.number)
    mdcv <- xgb.cv(data=dtrain, params = param, nthread=6, 
                    nfold=cv.nfold, nrounds=cv.nround,
                    verbose = T, early.stop.round=8, maximize=FALSE)

    min_logloss = min(mdcv[, test.mlogloss.mean])
    min_logloss_index = which.min(mdcv[, test.mlogloss.mean])

    if (min_logloss < best_logloss) {
        best_logloss = min_logloss
        best_logloss_index = min_logloss_index
        best_seednumber = seed.number
        best_param = param
    }
}

nround = best_logloss_index
set.seed(best_seednumber)
md <- xgb.train(data=dtrain, params=best_param, nrounds=nround, nthread=6)

With this code, you run cross validation 100 times, each time with random parameters. Then you get best parameter set, that is in the iteration with minimum min_logloss.

Increase the value of early.stop.round in case you find out that it's too small (too early stopping). You need also to change the random parameter values' limit based on your data characteristics.

And, for 100 or 200 iterations, I think you want to change verbose to FALSE.

Side note: That is example of random method, you can adjust it e.g. by Bayesian optimization for better method. If you have Python version of XGBoost, there is a good hyperparameter script for XGBoost, https://github.com/mpearmain/BayesBoost to search for best parameters set using Bayesian optimization.

Edit: I want to add 3rd manual method, posted by "Davut Polat" a Kaggle master, in the Kaggle forum.

Edit: If you know Python and sklearn, you can also use GridSearchCV along with xgboost.XGBClassifier or xgboost.XGBRegressor

understanding python xgboost cv

Cross-validation is used for estimating the performance of one set of parameters on unseen data.

Grid-search evaluates a model with varying parameters to find the best possible combination of these.

The sklearn docs talks a lot about CV, and they can be used in combination, but they each have very different purposes.

You might be able to fit xgboost into sklearn's gridsearch functionality. Check out the sklearn interface to xgboost for the most smooth application.

How to parallelize an xgboost fit?

As noted in the comments by HenrikB xgb.DMatrix objects can't be used in parallelization. To get around this we can make the object inside of foreach:

#### Load packages
library(xgboost)
library(parallel)
library(foreach)
library(doParallel)
#> Loading required package: iterators

data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')

#### Init parallel & run
numCores = detectCores()
cl = parallel::makeCluster(numCores, setup_strategy = "sequential")
doParallel::registerDoParallel(cl)
  
  
  
  
pred = foreach(i = 1:10, .packages = c("xgboost")) %dopar% {
    # BRING CREATION OF XGB MATRIX INSIDE OF foreach
    dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
    dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)
    
    watchlist = list(dtrain = dtrain, dtest = dtest)
    
    param <- list(max_depth = i, eta = 0.01, verbose = 0,
                  objective = "binary:logistic", eval_metric = "auc")
    bst <- xgb.train(param, dtrain, nrounds = 100, watchlist, early_stopping_rounds = 10)
    bst$best_score
    } 

stopCluster(cl) 
pred
#> [[1]]
#> dtest-auc 
#>  0.892138 
#> 
#> [[2]]
#> dtest-auc 
#>  0.987974 
#> 
#> [[3]]
#> dtest-auc 
#>  0.986255 
#> 
#> [[4]]
#> dtest-auc 
#>         1 
#>  ...

Benchmarking:

Since xgboost.train is already parellalized, it might be interesting to see the difference in speeds between when threads are used for xgboost vs when used for the parallel running of tuning rounds.

To do this I wrapped in a function and benchmarked the different combinations:


tune_par <- function(xgbthread, doparthread) {
  
  data(agaricus.train, package='xgboost')
  data(agaricus.test, package='xgboost')
  
  #### Init parallel & run
  cl = parallel::makeCluster(doparthread, setup_strategy = "sequential")
  doParallel::registerDoParallel(cl)
  
  clusterEvalQ(cl, {
    data(agaricus.train, package='xgboost')
    data(agaricus.test, package='xgboost')
  })
  
  
  
  pred = foreach(i = 1:10, .packages = c("xgboost")) %dopar% {
    dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
    dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)
    
    watchlist = list(dtrain = dtrain, dtest = dtest)
    
    param <- list(max_depth = i, eta = 0.01, verbose = 0, nthread = xgbthread,
                  objective = "binary:logistic", eval_metric = "auc")
    bst <- xgb.train(param, dtrain, nrounds = 100, watchlist, early_stopping_rounds = 10)
    bst$best_score
  } 
  
  stopCluster(cl) 
  
  pred
  
}

In my testing evaluation was faster when using more threads for xgboost and less for the parallel running of tuning rounds. What works best probably depends on system specs and the amount of data.

# 16 logical cores split between xgb threads and threads in dopar cluster:
microbenchmark::microbenchmark(
  xgb16par1 = tune_par(xgbthread = 16, doparthread = 1),
  xgb8par2 = tune_par(xgbthread = 8, doparthread = 2),
  xgb4par4 = tune_par(xgbthread = 4,doparthread = 4),
  xgb2par8 = tune_par(xgbthread = 2, doparthread = 8),
  xgb1par16 = tune_par(xgbthread = 1,doparthread = 16),
  times = 5
)
#> Unit: seconds
#>       expr      min       lq     mean   median       uq      max neval  cld
#>  xgb16par1 2.295529 2.431110 2.500170 2.519277 2.527914 2.727021     5 a   
#>   xgb8par2 2.301189 2.308377 2.407767 2.363422 2.465446 2.600402     5 a   
#>   xgb4par4 2.632711 2.778304 2.875816 2.825471 2.849003 3.293593     5  b  
#>   xgb2par8 4.508485 4.682284 4.752776 4.810461 4.822566 4.940085     5   c 
#>  xgb1par16 8.493378 8.550609 8.679931 8.768008 8.779718 8.807943     5    d

Xgboost in R: How Does Xgb.Cv Pass the Optimal Parameters into Xgb.Train