How to Use Custom Cross Validation Folds with Xgboost

How to apply predict to xgboost cross validation

kfold cv doesn't make the model more accurate per se. In your example with xgb, there are many hyper parameters eg(subsample, eta) to be specified, and to get a sense of how the parameters chosen perform on unseen data, we use kfold cv to partition the data into many training and test samples and measure out-of-sample accuracy.

We usually try this for several possible values of a parameter and what gives the lowest average error. After this you would refit your model with the parameters. This post and its answers discusses it.

For example, below we run something like what you did and we get only the train / test error for 1 set of values :

import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=500,class_sep=0.7)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.33, random_state=42)

data_dmatrix = xgb.DMatrix(data=X_train,label=y_train)
params = {'objective':'binary:logistic','eval_metric':'logloss',
'eta':0.01,
'subsample':0.1}
xgb_cv = xgb.cv(dtrain=data_dmatrix, params=params, nfold=5, metrics = 'logloss',seed=42)

train-logloss-mean train-logloss-std test-logloss-mean test-logloss-std
0 0.689600 0.000517 0.689820 0.001009
1 0.686462 0.001612 0.687151 0.002089
2 0.683626 0.001438 0.684667 0.003009
3 0.680450 0.001100 0.681929 0.003604
4 0.678269 0.001399 0.680310 0.002781
5 0.675170 0.001867 0.677254 0.003086
6 0.672349 0.002483 0.674432 0.004349
7 0.668964 0.002484 0.671493 0.004579
8 0.666361 0.002831 0.668978 0.004200
9 0.663682 0.003881 0.666744 0.003598

The last row is the result from last round, which is what we use for evaluation.

If we test over multiple values of eta ( and subsample for example:

grid = pd.DataFrame({'eta':[0.01,0.05,0.1]*2,
'subsample':np.repeat([0.1,0.3],3)})

eta subsample
0 0.01 0.1
1 0.05 0.1
2 0.10 0.1
3 0.01 0.3
4 0.05 0.3
5 0.10 0.3

Normally we can use GridSearchCV for this, but below is something that uses xgb.cv:

def fit(x):
params = {'objective':'binary:logistic',
'eval_metric':'logloss',
'eta':x[0],
'subsample':x[1]}
xgb_cv = xgb.cv(dtrain=data_dmatrix, params=params,
nfold=5, metrics = 'logloss',seed=42)
return xgb_cv[-1:].values[0]

grid[['train-logloss-mean','train-logloss-std',
'test-logloss-mean','test-logloss-std']] = grid.apply(fit,axis=1,result_type='expand')

eta subsample train-logloss-mean train-logloss-std test-logloss-mean test-logloss-std
0 0.01 0.1 0.663682 0.003881 0.666744 0.003598
1 0.05 0.1 0.570629 0.012555 0.580309 0.023561
2 0.10 0.1 0.503440 0.017761 0.526891 0.031659
3 0.01 0.3 0.646587 0.002063 0.653741 0.004201
4 0.05 0.3 0.512229 0.008013 0.545113 0.018700
5 0.10 0.3 0.414103 0.012427 0.472379 0.032606

We can see for eta = 0.10 and subsample = 0.3 gives the best result, so next you just need to refit the model with these parameters:

xgb_reg = xgb.XGBRegressor(objective='binary:logistic',
eval_metric = 'logloss',
eta = 0.1,
subsample = 0.3)

xgb_reg.fit(X_train, y_train)

Get out-of-fold predictions from xgboost.cv in python

I'm not sure if this is what you want, but you can accomplish this by using the sklearn wrapper for xgboost: (I know I'm using iris dataset as regression problem -- which it isn't but this is for illustration).

import xgboost as xgb
from sklearn.cross_validation import cross_val_predict as cvp
from sklearn import datasets
X = datasets.load_iris().data[:, :2]
y = datasets.load_iris().target
xgb_model = xgb.XGBRegressor()
y_pred = cvp(xgb_model, X, y, cv=3, n_jobs = 1)
y_pred

array([ 9.07209516e-01, 1.84738374e+00, 1.78878939e+00,
1.83672094e+00, 9.07209516e-01, 9.07209516e-01,
1.77482617e+00, 9.07209516e-01, 1.75681138e+00,
1.83672094e+00, 9.07209516e-01, 1.77482617e+00,
1.84738374e+00, 1.84738374e+00, 1.12216723e+00,
9.96944368e-01, 9.07209516e-01, 9.07209516e-01,
9.96944368e-01, 9.07209516e-01, 9.07209516e-01,
9.07209516e-01, 1.77482617e+00, 8.35850239e-01,
1.77482617e+00, 9.87186074e-01, 9.07209516e-01,
9.07209516e-01, 9.07209516e-01, 1.78878939e+00,
1.83672094e+00, 9.07209516e-01, 9.07209516e-01,
8.91427517e-01, 1.83672094e+00, 9.09049034e-01,
8.91427517e-01, 1.83672094e+00, 1.84738374e+00,
9.07209516e-01, 9.07209516e-01, 1.01038718e+00,
1.78878939e+00, 9.07209516e-01, 9.07209516e-01,
1.84738374e+00, 9.07209516e-01, 1.78878939e+00,
9.07209516e-01, 8.35850239e-01, 1.99947178e+00,
1.99947178e+00, 1.99947178e+00, 1.94922602e+00,
1.99975276e+00, 1.91500926e+00, 1.99947178e+00,
1.97454870e+00, 1.99947178e+00, 1.56287444e+00,
1.96453893e+00, 1.99947178e+00, 1.99715066e+00,
1.99947178e+00, 2.84575284e-01, 1.99947178e+00,
2.84575284e-01, 2.00303388e+00, 1.99715066e+00,
2.04597521e+00, 1.99947178e+00, 1.99975276e+00,
2.00527954e+00, 1.99975276e+00, 1.99947178e+00,
1.99947178e+00, 1.99975276e+00, 1.99947178e+00,
1.99947178e+00, 1.91500926e+00, 1.95735490e+00,
1.95735490e+00, 2.00303388e+00, 1.99975276e+00,
5.92201948e-04, 1.99947178e+00, 1.99947178e+00,
1.99715066e+00, 2.84575284e-01, 1.95735490e+00,
1.89267385e+00, 1.99947178e+00, 2.00303388e+00,
1.96453893e+00, 1.98232651e+00, 2.39597082e-01,
2.39597082e-01, 1.99947178e+00, 1.97454870e+00,
1.91500926e+00, 9.99531507e-01, 1.00023842e+00,
1.00023842e+00, 1.00023842e+00, 1.00023842e+00,
1.00023842e+00, 9.22234297e-01, 1.00023842e+00,
1.00100708e+00, 1.16144836e-01, 1.00077248e+00,
1.00023842e+00, 1.00023842e+00, 1.00100708e+00,
1.00023842e+00, 1.00077248e+00, 1.00023842e+00,
1.13711983e-01, 1.00023842e+00, 1.00135887e+00,
1.00077248e+00, 1.00023842e+00, 1.00023842e+00,
1.00023842e+00, 9.99531507e-01, 1.00077248e+00,
1.00023842e+00, 1.00023842e+00, 1.00023842e+00,
1.00023842e+00, 1.00023842e+00, 1.13711983e-01,
1.00023842e+00, 1.00023842e+00, 1.00023842e+00,
1.00023842e+00, 9.78098869e-01, 1.00023842e+00,
1.00023842e+00, 1.00023842e+00, 1.00023842e+00,
1.00023842e+00, 1.00023842e+00, 1.00077248e+00,
9.99531507e-01, 1.00023842e+00, 1.00100708e+00,
1.00023842e+00, 9.78098869e-01, 1.00023842e+00], dtype=float32)

xgb.cv only seems to use training data for xfold validation?

The xgboost package allows you to choose whether you want to use the inbuilt cross-validation method or to specify your own cross-validation.

Of course you can do both and see the difference!

If you scan down the page that you linked for xgb.cv method to "Details" you will see some brief details of how you can extract information from the completed model.

The 10-fold cross-validation method means that internally the xgboost cv algorithm is doing successive splits of your data in the proportions 10% for testing to 90% for training so that all the data will in turn be used.
This use of the algorithm makes and evaluates in effect ten different models and presents you with the results.
You can adjust various hyper-parameters to improve your model either manually or through say a grid search.

If you want to do your own data split rather than use the inbuilt cross-validation method then use the "vanilla" form of the algorithm:

model <- xgboost(data = ......etc) # in R

An advantage I think of the xgb.cv formulation is that it gives you access to many more hyperparameters to tweak.

The plain xgboost(....) model using your own train/test split rather than the inbuilt cv version may be better or even essential in some cases for example where your data have a time-sensitive structure.
Say you were interested in sales data over the past 10 years it may be better to take the first nine years data for training and use the last year as your test set.

What I did was to start with the "vanilla" formulation and build a model with default parameters. This became my baseline model for comparison purposes. Successive models of more complexity could be built and their performances compared to this baseline.



Related Topics



Leave a reply



Submit