How to apply predict to xgboost cross validation
kfold cv doesn't make the model more accurate per se. In your example with xgb, there are many hyper parameters eg(subsample, eta) to be specified, and to get a sense of how the parameters chosen perform on unseen data, we use kfold cv to partition the data into many training and test samples and measure out-of-sample accuracy.
We usually try this for several possible values of a parameter and what gives the lowest average error. After this you would refit your model with the parameters. This post and its answers discusses it.
For example, below we run something like what you did and we get only the train / test error for 1 set of values :
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=500,class_sep=0.7)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.33, random_state=42)
data_dmatrix = xgb.DMatrix(data=X_train,label=y_train)
params = {'objective':'binary:logistic','eval_metric':'logloss',
'eta':0.01,
'subsample':0.1}
xgb_cv = xgb.cv(dtrain=data_dmatrix, params=params, nfold=5, metrics = 'logloss',seed=42)
train-logloss-mean train-logloss-std test-logloss-mean test-logloss-std
0 0.689600 0.000517 0.689820 0.001009
1 0.686462 0.001612 0.687151 0.002089
2 0.683626 0.001438 0.684667 0.003009
3 0.680450 0.001100 0.681929 0.003604
4 0.678269 0.001399 0.680310 0.002781
5 0.675170 0.001867 0.677254 0.003086
6 0.672349 0.002483 0.674432 0.004349
7 0.668964 0.002484 0.671493 0.004579
8 0.666361 0.002831 0.668978 0.004200
9 0.663682 0.003881 0.666744 0.003598
The last row is the result from last round, which is what we use for evaluation.
If we test over multiple values of eta
( and subsample
for example:
grid = pd.DataFrame({'eta':[0.01,0.05,0.1]*2,
'subsample':np.repeat([0.1,0.3],3)})
eta subsample
0 0.01 0.1
1 0.05 0.1
2 0.10 0.1
3 0.01 0.3
4 0.05 0.3
5 0.10 0.3
Normally we can use GridSearchCV for this, but below is something that uses xgb.cv:
def fit(x):
params = {'objective':'binary:logistic',
'eval_metric':'logloss',
'eta':x[0],
'subsample':x[1]}
xgb_cv = xgb.cv(dtrain=data_dmatrix, params=params,
nfold=5, metrics = 'logloss',seed=42)
return xgb_cv[-1:].values[0]
grid[['train-logloss-mean','train-logloss-std',
'test-logloss-mean','test-logloss-std']] = grid.apply(fit,axis=1,result_type='expand')
eta subsample train-logloss-mean train-logloss-std test-logloss-mean test-logloss-std
0 0.01 0.1 0.663682 0.003881 0.666744 0.003598
1 0.05 0.1 0.570629 0.012555 0.580309 0.023561
2 0.10 0.1 0.503440 0.017761 0.526891 0.031659
3 0.01 0.3 0.646587 0.002063 0.653741 0.004201
4 0.05 0.3 0.512229 0.008013 0.545113 0.018700
5 0.10 0.3 0.414103 0.012427 0.472379 0.032606
We can see for eta = 0.10
and subsample = 0.3
gives the best result, so next you just need to refit the model with these parameters:
xgb_reg = xgb.XGBRegressor(objective='binary:logistic',
eval_metric = 'logloss',
eta = 0.1,
subsample = 0.3)
xgb_reg.fit(X_train, y_train)
Get out-of-fold predictions from xgboost.cv in python
I'm not sure if this is what you want, but you can accomplish this by using the sklearn wrapper for xgboost: (I know I'm using iris dataset as regression problem -- which it isn't but this is for illustration).
import xgboost as xgb
from sklearn.cross_validation import cross_val_predict as cvp
from sklearn import datasets
X = datasets.load_iris().data[:, :2]
y = datasets.load_iris().target
xgb_model = xgb.XGBRegressor()
y_pred = cvp(xgb_model, X, y, cv=3, n_jobs = 1)
y_pred
array([ 9.07209516e-01, 1.84738374e+00, 1.78878939e+00,
1.83672094e+00, 9.07209516e-01, 9.07209516e-01,
1.77482617e+00, 9.07209516e-01, 1.75681138e+00,
1.83672094e+00, 9.07209516e-01, 1.77482617e+00,
1.84738374e+00, 1.84738374e+00, 1.12216723e+00,
9.96944368e-01, 9.07209516e-01, 9.07209516e-01,
9.96944368e-01, 9.07209516e-01, 9.07209516e-01,
9.07209516e-01, 1.77482617e+00, 8.35850239e-01,
1.77482617e+00, 9.87186074e-01, 9.07209516e-01,
9.07209516e-01, 9.07209516e-01, 1.78878939e+00,
1.83672094e+00, 9.07209516e-01, 9.07209516e-01,
8.91427517e-01, 1.83672094e+00, 9.09049034e-01,
8.91427517e-01, 1.83672094e+00, 1.84738374e+00,
9.07209516e-01, 9.07209516e-01, 1.01038718e+00,
1.78878939e+00, 9.07209516e-01, 9.07209516e-01,
1.84738374e+00, 9.07209516e-01, 1.78878939e+00,
9.07209516e-01, 8.35850239e-01, 1.99947178e+00,
1.99947178e+00, 1.99947178e+00, 1.94922602e+00,
1.99975276e+00, 1.91500926e+00, 1.99947178e+00,
1.97454870e+00, 1.99947178e+00, 1.56287444e+00,
1.96453893e+00, 1.99947178e+00, 1.99715066e+00,
1.99947178e+00, 2.84575284e-01, 1.99947178e+00,
2.84575284e-01, 2.00303388e+00, 1.99715066e+00,
2.04597521e+00, 1.99947178e+00, 1.99975276e+00,
2.00527954e+00, 1.99975276e+00, 1.99947178e+00,
1.99947178e+00, 1.99975276e+00, 1.99947178e+00,
1.99947178e+00, 1.91500926e+00, 1.95735490e+00,
1.95735490e+00, 2.00303388e+00, 1.99975276e+00,
5.92201948e-04, 1.99947178e+00, 1.99947178e+00,
1.99715066e+00, 2.84575284e-01, 1.95735490e+00,
1.89267385e+00, 1.99947178e+00, 2.00303388e+00,
1.96453893e+00, 1.98232651e+00, 2.39597082e-01,
2.39597082e-01, 1.99947178e+00, 1.97454870e+00,
1.91500926e+00, 9.99531507e-01, 1.00023842e+00,
1.00023842e+00, 1.00023842e+00, 1.00023842e+00,
1.00023842e+00, 9.22234297e-01, 1.00023842e+00,
1.00100708e+00, 1.16144836e-01, 1.00077248e+00,
1.00023842e+00, 1.00023842e+00, 1.00100708e+00,
1.00023842e+00, 1.00077248e+00, 1.00023842e+00,
1.13711983e-01, 1.00023842e+00, 1.00135887e+00,
1.00077248e+00, 1.00023842e+00, 1.00023842e+00,
1.00023842e+00, 9.99531507e-01, 1.00077248e+00,
1.00023842e+00, 1.00023842e+00, 1.00023842e+00,
1.00023842e+00, 1.00023842e+00, 1.13711983e-01,
1.00023842e+00, 1.00023842e+00, 1.00023842e+00,
1.00023842e+00, 9.78098869e-01, 1.00023842e+00,
1.00023842e+00, 1.00023842e+00, 1.00023842e+00,
1.00023842e+00, 1.00023842e+00, 1.00077248e+00,
9.99531507e-01, 1.00023842e+00, 1.00100708e+00,
1.00023842e+00, 9.78098869e-01, 1.00023842e+00], dtype=float32)
xgb.cv only seems to use training data for xfold validation?
The xgboost package allows you to choose whether you want to use the inbuilt cross-validation method or to specify your own cross-validation.
Of course you can do both and see the difference!
If you scan down the page that you linked for xgb.cv method to "Details" you will see some brief details of how you can extract information from the completed model.
The 10-fold cross-validation method means that internally the xgboost cv algorithm is doing successive splits of your data in the proportions 10% for testing to 90% for training so that all the data will in turn be used.
This use of the algorithm makes and evaluates in effect ten different models and presents you with the results.
You can adjust various hyper-parameters to improve your model either manually or through say a grid search.
If you want to do your own data split rather than use the inbuilt cross-validation method then use the "vanilla" form of the algorithm:
model <- xgboost(data = ......etc) # in R
An advantage I think of the xgb.cv formulation is that it gives you access to many more hyperparameters to tweak.
The plain xgboost(....) model using your own train/test split rather than the inbuilt cv version may be better or even essential in some cases for example where your data have a time-sensitive structure.
Say you were interested in sales data over the past 10 years it may be better to take the first nine years data for training and use the last year as your test set.
What I did was to start with the "vanilla" formulation and build a model with default parameters. This became my baseline model for comparison purposes. Successive models of more complexity could be built and their performances compared to this baseline.
Related Topics
Removing/Replacing Brackets from R String Using Gsub
How to Split a Dataframe Column by The First Instance of a Character in Its Values
Install Previous Versions of R on Ubuntu
Rstudio Viewer Pane Not Working
Converting an Xts Object to a Data.Frame
Removing "Nul" Characters (Within R)
Extracting "((Adj|Noun)+|((Adj|Noun)(Noun-Prep))(Adj|Noun))Noun" from Text (Justeson & Katz, 1995)
How to Align or Center The Bars of a Histogram on The X Axis
R: How to Expand a Row Containing a "List" to Several Rows...One for Each List Member
R - Error When Using Geturl from Curl After Site Was Changed
Get Start and End Index of Runs of Values
Order Dataframe for Given Columns
Schedule a Rscript Crontab Everyminute