Random forest

The customer satisfaction data was covered in Chapter 3, Logistic Regression. The GitHub links to the CSV and an RData file are as follows:

I'll show you how to load the RData file:

> santander <- readRDS("santander_prepd.RData")

The data has an unbalanced response:

> table(santander$y)

0 1
73012 3008

We'll split the train and test sets using the same random seed as in Chapter 3, Logistic Regression:

> set.seed(1966)

> trainIndex <- caret::createDataPartition(santander$y, p = 0.8, list = FALSE)

> train <- santander[trainIndex, ]

> test <- santander[-trainIndex, ]

With this split, we end up with one zero variance features, which we'll find and remove:

> train_zero <- caret::nearZeroVar(train, saveMetrics = TRUE)

> table(train_zero$zeroVar)

FALSE TRUE
142 1

> train <- train[, train_zero$zeroVar == 'FALSE']

I like to put the predictors in a matrix, and we'll need the response as a factor:

> x <- as.matrix(train[, -142])

> y <- as.factor(train$y)

We're now ready to start training a model. Recall that we have a highly unbalanced response. One of the things I highly recommend in such an instance is to structure your sample size. In fact, this can actually become a key parameter to tune. In the training data, there are only 2,405 ones alongside 58,411 zeros. I'll show an example of forcing the sample at each bagged sample in the algorithm. Again, this will take some trial and error on your part to identify the right ratio of downsampling the majority class to minority class. In the following example, I'm sampling 1,200 of the minority class and 3,600 of the majority class. This was some simple trial and error on my part, so see if you can do better. What this does to your predicted probabilities is skew them towards the minority class—in other words, you have a relative probability. This might not be what the business desires, so you can apply a correction to produce the corrected probability:

The population proportion is the actual, or the estimated proportion of the minority class, and the sample proportion is from your oversample. Predicted probability equates to the model's probability for a given observation.

The other thing I specify here is the number of trees, and 200 for starters is sufficient. In some instances, you may need a thousand or more:

> set.seed(1999)

> forest_fit <- randomForest::randomForest(x = x, y = y,
ntree = 200,
sampsize = c(3600, 1200))

Calling the fitted object gives us the following results:

> forest_fit

Call:
randomForest(x = x, y = y, ntree = 200, sampsize = c(3600, 1200))
Type of random forest: classification
Number of trees: 200
No. of variables tried at each split: 11

OOB estimate of error rate: 9.51%
Confusion matrix:
0 1 class.error
0 53946 4465 0.07644108
1 1321 1084 0.54927235

You can notice that the out-of-bag error rate is under 10%, and it gives us a confusion matrix. My advice is to not pay that much attention to this and just make a note of it. We'll use our relative probabilities to find the best split. Besides, as we've talked about in other chapters, error/accuracy isn't the best metric to judge a model. One thing that's good to look at is the number of trees that minimized the error. That way, you can limit the number of trees in an attempt to avoid overfitting:

> which.min(forest_fit$err.rate[, 1])
[1] 105

> forest_fit$err.rate[105]
[1] 0.0934458

There you have it: only 105 trees are needed to minimize the error versus all 200. In this model, we used all 142 features. That's just computationally inefficient and prone to cause overfitting.

I'll show you what has worked quite well for me on a number of projects to reduce features. Before we go there, here's the standard feature importance plot:

> randomForest::varImpPlot(forest_fit)

The output of the preceding code is as follows:

The feature importance is based on the average decrease in Gini. We can see that there're roughly a dozen or so features driving predictions, and the V2 feature is quite suspicious. I talk about the notorious V2 feature in Chapter 3Logistic Regression so I won't belabor the point here. 

What I'm about to present you may find controversial or unscientific. Well, I have to agree. However, it works. What I do is find the descriptive statistics for feature importance and decide, based on some experimental value or business expertise, where to filter the features. It's best to demonstrate how in our example:

> ff <- data.frame(unlist(forest_fit$importance))

> ff$var <- row.names(ff)

> summary(ff)
MeanDecreaseGini var
Min. : 0.00000 Length:141
1st Qu. : 0.02172 Class :character
Median : 0.36412 Mode :character
Mean : 6.12824
3rd Qu. : 4.02137
Max. :155.86878

In the lack of subject matter expertise or otherwise, we can cut the features based on mean Gini decrease above the third quantile:

> my_forest_vars <- dplyr::filter(ff, MeanDecreaseGini > 4.02)

> my_forest_vars <- my_forest_vars$var

> x_reduced <- x[, my_forest_vars]

> dim(x_reduced)
[1] 60816 36

That gave us just 36 input features. We could reduce it even further, but I'll let you experiment with different results. Now, build the new model with reduced features:

> set.seed(567)

> forest_fit2 <- randomForest::randomForest(x = x_reduced, y = y,
ntree = 110,
sampsize = c(3600, 1200))

> which.min(forest_fit2$err.rate[, 1])
[1] 98

Examine how it does on the training data first:

> rf_prob <- predict(forest_fit, type = "prob")

> y_prob <- rf_prob[, 2]

> classifierplots::density_plot(y, y_prob)

The output of the preceding code is as follows:

Now, pursue identifying the metrics:

> ynum <- as.numeric(ifelse(y == "1", 1, 0))

> MLmetrics::AUC(y_prob, ynum)
[1] 0.8154905

> MLmetrics::LogLoss(y_prob, ynum)
[1] 0.2652151

The AUC seems about what came about what we would expect, but the log-loss is quite a bit worse. Why? Ah, yes, our problem is the relative probabilities. We need to adjust the predicted probabilities, then recalculate log-loss. I'll make it straightforward to do this by taking the formula I discussed previously and putting it into a function:

> corrected_prob <- function(result, population_fraction, sample_fraction){
value <- 1/(1+(1/population_fraction-1) / (1/sample_fraction-1)*(1/result-1))
return(value)
}

Then, we apply the function to the predicted results:

> yprob_corrected <- corrected_prob(result = y_prob,
population_fraction = 0.04,
sample_fraction = .33

We can see that the AUC hasn't changed, but the log-loss has improved:

> MLmetrics::AUC(yprob_corrected, ynum) 
[1] 0.8154905

> MLmetrics::LogLoss(yprob_corrected, ynum)
[1] 0.188308

In fact, it's more in line now with what we saw in Chapter 3, Logistic Regression. It's worth exploring whether it can be close to the 0.14 log-loss and 0.81 AUC values we achieved with the MARS model on the test set:

> rf_test <- predict(forest_fit, type = "prob", newdata = test)

> rf_test <- rf_test[, 2]

> ytest <- as.numeric(ifelse(test$y == "1", 1, 0))

> MLmetrics::AUC(rf_test, ytest)
[1] 0.8149009

Nicely done on the AUC! Correct those probabilities and get the log-loss:

> rftest_corrected <- corrected_probability(result = rf_test,
population_fraction = 0.04,
sample_fraction = 0.33)

> MLmetrics::LogLoss(rftest_corrected, ytest)
[1] 0.1787402

We actually improved the log-loss versus the training data, but didn't win the battle versus MARS. What to do? Well, we're going to give XGboost a try next. We could go back and and tune the number of trees, oversampling fraction, or the number of features sampled per tree or even just say that, on this dataset, MARS did its job. It's been my experience that, on unbalanced labels, random forest will outperform MARS.

However, this case does demonstrate the power of MARS as your go-to baseline model. Let's do one further drill-down and plot the AUC curves of random forest versus MARS. Please note that this last step requires you to have executed the code in Chapter 3, Logistic Regression. If you haven't saved the results, go back and run the MARS example before proceeding with the following:

pred.rf <- ROCR::prediction(rftest_corrected, test$y)

perf.rf <- ROCR::performance(pred.rf, "tpr", "fpr")

ROCR::plot(perf.rf, main = "ROC", col = 1)

pred.earth <- ROCR::prediction(test_pred, test$y)

perf.earth <- ROCR::performance(pred.earth, "tpr", "fpr")

ROCR::plot(perf.earth, col = 2, add = TRUE)
legend(0.6, 0.6, c("RF", "MARS"), 1:2)

The output of the preceding code is as follows:

This is quite revealing. Here, we to very distinct curves where the AUC values are almost identical. Indeed, from a true positive rate of 0.4 to 0.8, random forest outperforms MARS. The key learning here is that a model performance value in and of itself isn't enough to guide model selection. 

While random forest once again proved itself a capable tool for classification, let's see how gradient boosting trees can perform.