Extreme gradient boosting – classification

As mentioned previously, we'll be using the xgboost package in this section. Given the method's well-earned reputation, let's try it on the santander data.

As stated in the boosting overview, you can tune a number of parameters:

  • nrounds: This is the maximum number of iterations (number of trees in the final model).
  • colsample_bytree: This is the number of features, expressed as a ratio, to sample when building a tree. The default is 1 (100% of the features).
  • min_child_weight: This is the minimum weight in the trees being boosted. The default is 1.
  • eta: This is the learning rate, which is the contribution of each tree to the solution. The default is 0.3.
  • gamma: This is the minimum loss reduction required to make another leaf partition in a tree.
  • subsample: This is the ratio of data observations. The default is 1 (100%).
  • max_depth: This is the maximum depth of the individual trees.
Using the expand.grid() function, we'll build our experimental grid to run through the training process of the caret package. If you don't specify values for all of the preceding parameters, even if it's just a default, you'll receive an error message when you execute the function. The following values are based on a number of training iterations I've done previously. I encourage you to try your own tuning values.

Tuning this can be a daunting task computationally speaking. For our example, we'll just focus on tuning eta and gamma. Let's build the grid as follows:

> grid = expand.grid(
nrounds = 100,
colsample_bytree = 1,
min_child_weight = 1,
eta = c(0.1, 0.3, 0.5), #0.3 is default,
gamma = c(0.25, 0.5),
subsample = 1,
max_depth = c(3)
)

This creates a grid of six different models that the caret package will run to determine the best tuning parameters. A note of caution is in order. On a dataset of the size that we'll be working with, this process takes only a few minutes. However, in large datasets or tuning more parameters with more values per parameter, this can take hours. As such, you must apply your judgment and possibly experiment with smaller samples of the data in order to identify the tuning parameters, in case time is of the essence or you're constrained by the size of your hard drive.

Before using the train() function from the caret package, I would like to specify the trainControl argument by creating an object called control. This object will store the method that we want so as to train the tuning parameters. We'll use 5 fold cross-validation, as follows:

> cntrl = caret::trainControl(
+ method = "cv",
+ number = 5,
+ verboseIter = TRUE,
+ returnData = FALSE,
+ returnResamp = "final"
+ )

To utilize the train.xgb() function, just specify the formula as we did with the other models: the train dataset input values, labels, method, train control, metric, and experimental grid. Remember to set the random seed:

> set.seed(123)

> train.xgb = caret::train(
x = x_reduced,
y = y,
trControl = cntrl,
tuneGrid = grid,
method = "xgbTree",
metric = "Kappa"
)

Since in trControl I set verboseIter to TRUE, you should have seen each training iteration within each k-fold.

Calling the object gives us the optimal parameters and the results of each of the parameter settings, as follows (this is abbreviated for simplicity):

> train.xgb
eXtreme Gradient Boosting
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 48653, 48653, 48653, 48652, 48653
Resampling results across tuning parameters:

eta gamma Accuracy Kappa
0.1 0.25 0.9604545 0.001525813
0.1 0.50 0.9604709 0.002323003
0.3 0.25 0.9604216 0.014214973
0.3 0.50 0.9604052 0.014215605
0.5 0.25 0.9600434 0.015513354
0.5 0.50 0.9599776 0.013964451

Tuning parameter 'nrounds' was held constant at a value of 100
1
Tuning parameter 'min_child_weight' was held constant at a value of
1
Tuning parameter 'subsample' was held constant at a value of 1
Kappa was used to select the optimal model using the largest value.
The final values used for the model were nrounds = 100, max_depth = 3, eta
= 0.5, gamma = 0.25, colsample_bytree = 1, min_child_weight = 1
and subsample = 1.

The best results are with eta = 0.5, and gamma = 0.25. Now it gets a little tricky, but this is what I've seen as best practice. First, create a list of parameters that will be used by the xgboost training function, xgb.train(). Then, turn the dataframe into a matrix of input features and a list of labeled numeric outcomes (0s and 1s). Then, turn the features and labels into the input required, as xgb.Dmatrix. Try this:

> param <- list( objective = "binary:logistic",
booster = "gbtree",
eval_metric = "error",
eta = 0.5,
max_depth = 3,
subsample = 1,
colsample_bytree = 1,
gamma = 0.25
)
> train.mat <- xgboost::xgb.DMatrix(data = x_reduced, label = ynum)

With all of that prepared, just create the model:

> set.seed(1232)

> xgb.fit <- xgboost::xgb.train(params = param, data = train.mat, nrounds =
100)

Before seeing how it does on the test set, let's check the variable importance and plot it. You can examine three items: gain, cover, and frequencyGain is the improvement in accuracy that feature brings to the branches it's on. Cover is the relative number of total observations related to this feature. Frequency is the percentage of times that feature occurs in all of the trees. The following code produces the desired output:

> impMatrix <- xgboost::xgb.importance(feature_names = dimnames(x)[[2]],
model = xgb.fit)

> xgboost::xgb.plot.importance(impMatrix, main = "Gain by Feature")

The output of the preceding command is as follows:

How does the feature importance compare to random forest? Feature V2 remains the most important, and roughly the top ten are the same. Note that it does very well on the training data:

> pred <- predict(xgb.fit, x_reduced)

> MLmetrics::AUC(pred, y) #.88
[1] 0.8839242

> MLmetrics::LogLoss(pred, ynum) #.12
[1] 0.1209341

Impressed? Well, here is how we see it performed on the test set, which, like the training data, must be in a matrix:

> test_xgb <- as.matrix(test)

> test_xgb <- test_xgb[, my_forest_vars]

> xgb_test_matrix <- xgboost::xgb.DMatrix(data = test_xgb, label = ytest)

> xgb_pred <- predict(xgb.fit, xgb_test_matrix)

> Metrics::auc(ytest, xgb_pred) #.83
[1] 0.8282241

> MLmetrics::LogLoss(xgb_pred, ytest) #.138
[1] 0.1380904

What happened here is that the model had the lowest bias on the training data, but the performance falls off on the test data. Even so, it still has the highest AUC and lowest log-loss. Like we did with random forest, let's compare the ROC plot with xgboost added:

> ROCR::plot(perf.rf, main = "ROC", col = "black")

> ROCR::plot(perf.earth, col = "red", add = TRUE)

> pred.xgb <- ROCR::prediction(xgb_pred, test$y)

> perf.xgb <- ROCR::performance(pred.xgb, "tpr", "fpr")

> ROCR::plot(perf.xgb, col = "green", add = TRUE)

> legend(x = .75, y = .5,
legend = c("RF", "MARS", "XGB"),
fil = c("black", "red", "green"),
col = c(1,2,3))

The output of the proceeding code is as follows:

The xgboost model sort of combines the best of random forest and MARS in performance. All that will minimal tuning of hyperparameters. This clearly shows the power of the method and why it has become so popular.

Before we bring this chapter to a close, I want to introduce the powerful method of feature elimination using random forest techniques.