Elastic net

For our purposes here, we want to focus on finding the optimal mix of lambda and our elastic net mixing parameter, alpha. This is done using the following simple three-step process:

  1. Use the expand.grid() function in base R to create a vector of all of the possible combinations of alpha and lambda that we want to investigate.
  2. Use the trainControl() function from the caret package to determine the resampling method; we'll use 5-fold cross-validation again.
  3. Train a model to select our alpha and lambda parameters using glmnet() in caret's train() function.

Once we've selected our parameters, we'll apply them to the test data in the same way as we did with ridge regression and LASSO.

Our grid of combinations should be large enough to capture the best model but not so large that it becomes computationally unfeasible. That won't be a problem with this big a dataset, but keep this in mind for future reference.

The following are the hyperparameters values we'll try:

  • Alpha from 0 to 1 by 0.2 increments; remember that this is bound by 0 and 1
  • Lambda from 0.01 to 0.03 in steps of 0.002

You can create this matrix by using the expand.grid() function and building a sequence of numbers that the caret package will automatically use. The caret package will take the values for alpha and lambda with the following code:

> grid <-
expand.grid(.alpha = seq(0, 1, by = .2),
.lambda = seq(0.01, 0.03, by = 0.002))

> head(grid)
.alpha .lambda
1 0.0 0.01
2 0.2 0.01
3 0.4 0.01
4 0.6 0.01
5 0.8 0.01
6 1.0 0.01

There are 66 different models to be built, compared, and selected. The preceding list shows the various combinations with all of the possible alpha parameters for a lambda of 0.01. Now, we set up an object to specify we want to do 5-fold cross-validation:

> control <- caret::trainControl(method = 'cv', number = 5)

Training the model with caret in this instance requires y to be a factor, which we've already done. It also requires the specification of train control or passing an object as we just did. There're a couple of different selection metrics you can choose from for a classification problem: accuracy or Kappa. Well, we covered this in the previous chapter, in a class imbalance situation; I think Kappa is preferred. Refer to the previous chapter if you need to refresh your understanding of Kappa. The following is the relevant code:

> set.seed(2222)
> enet <- caret::train(x,
y,
method = "glmnet",
trControl = control,
tuneGrid = grid,
metric = "Kappa")

To find the best overall model according to Kappa, we call the best-tuned version:

> enet$bestTune
alpha lambda
23 0.4 0.01

The best model is alpha 0.4 and lambda 0.01. To see how it affects the coefficients (logits), we will run them through glmnet without cross-validation:

> best_enet <- glmnet::glmnet(x,
y,
alpha = 0.4,
lambda = 0.01,
family = "binomial")

> coef(best_enet)
17 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) 1.310419410
TwoFactor1 -0.933300729
TwoFactor2 0.917877320
Linear1 .
Linear2 -0.689547039
Linear3 0.619432149
Linear4 -0.416603510
Linear5 0.315207408
Linear6 0.002005802
Nonlinear1 0.454620511
Nonlinear2 0.224564104
Nonlinear3 0.343687158
Noise1 -0.009290811
Noise2 .
Noise3 .
Noise4 0.014674805
random1 -0.261039240

With alpha at 0.4, three features are forced to zero. Examining the metrics on training data comes next:

> enet_pred <- predict(enet, train, type = "prob")

> Metrics::auc(y, enet_pred$`1`)
[1] 0.8684076

> classifierplots::density_plot(y, enet_pred$`1`)

The output of the preceding code is as follows:

The probability skew for labels of 1 seems higher than the previous models as well as for labels of 0. The AUC is in line with the other models as well. The proof will lie in predicting the test data:

> enet_test <-
predict(enet, test, type = "prob")

> Metrics::auc(test$y, enet_test$`1`)
[1] 0.8748963

> Metrics::logLoss(test$y, enet_test$`1`)
[1] 0.3977438

> classifierplots::density_plot(test$y, enet_test$`1`)

The output of the preceding code is as follows:

There's a consistent skew in the distributions and a superior AUC and log-loss versus the other two models, so it seems our elastic net version is the best. We can confirm this by looking at the ROC plots of all three models, using a similar technique to evaluate the classifiers visually, as in the previous chapter:

pred.ridge <- ROCR::prediction(ridge_test$X1, test$y)

perf.ridge <- ROCR::performance(pred.ridge, "tpr", "fpr")

ROCR::plot(perf.ridge, main = "ROC", col = 1)

pred.lasso <- ROCR::prediction(lasso_test$X1, test$y)

perf.lasso <- ROCR::performance(pred.lasso, "tpr", "fpr")

ROCR::plot(perf.lasso, col = 2, add = TRUE)

pred.enet <- ROCR::prediction(enet_test$'1', test$y)

perf.enet <- ROCR::performance(pred.enet, "tpr", "fpr")

ROCR::plot(perf.enet, col = 3, add = TRUE)

legend(0.6, 0.6, c("Ridge", "LASSO", "ENET"), 1:3)

The output of the preceding code is as follows:

I think, as we would expect, the elastic net is just ever so slightly better than the other two. Which model goes into production is a matter for you and your business partners to decide as you balance complexity and performance.