Support vector machine

If you recall from a previous section, the first thing we did was perform RFE to reduce our input features. We'll repeat that step in the following. We'll redo our control function:

> ctrl <- caret::rfeControl(
functions = caret::lrFuncs,
method = "cv",
number = 10,
verbose = TRUE
)

I say we shoot for around 20 to 30 total features and set our random seed:

> subsets <- c(20:30)

> set.seed(54321)

Now, in selecting the features you can use the SVM linear or the kernel functions. Let's proceed with linear, which means our specification for the following method will be svmLinear. If, for instance, you wanted to change to a polynomial kernel, then you would specify svmPoly instead or svmRadial for the radial basis function:

> svmProfile <- caret::rfe(
train_df,
train_y,
sizes = subsets,
rfeControl = ctrl,
method = "svmLinear",
metric = "Kappa"
)

> svmProfile
Recursive feature selection
Outer resampling method: Cross-Validated (10 fold)
Resampling performance over subset size:
Variables Accuracy Kappa AccuracySD KappaSD Selected
20 0.8357 0.5206 0.008253 0.02915
21 0.8350 0.5178 0.008624 0.03091
22 0.8359 0.5204 0.008277 0.02948
23 0.8361 0.5220 0.009435 0.02979
24 0.8383 0.5292 0.008560 0.02572 *
25 0.8375 0.5261 0.008067 0.02323
26 0.8379 0.5290 0.010193 0.02905
27 0.8375 0.5276 0.009205 0.02667
28 0.8372 0.5259 0.008770 0.02437
29 0.8361 0.5231 0.008074 0.02319
30 0.8368 0.5252 0.008069 0.02401
39 0.8377 0.5290 0.009290 0.02711

The top 5 variables (out of 24):
V74, V35, V22, V78, V20

The optimal Kappa and accuracy are with 24 features. Notice that the top five features are the same as when we ran this with KNN. Here's how to plot the Kappa score per number of features:

> svm_results <- svmProfile$results

> ggplot2::ggplot(svm_results, aes(Variables, Kappa)) +
ggplot2::geom_line(color = 'steelblue', size = 2) +
ggthemes::theme_fivethirtyeight()

The output of the preceding code is as follows:

Let's select a dataframe with only the optimal features:

> svm_vars <- svmProfile$optVariables

> x_selected <-
train_df[, (colnames(train_df) %in% svm_vars)]

With our features selected, we can train a model with cross-validation, and in the process tune the hyperparameter, C. If you recall from previously, this is the regularization parameter. We'll go forward with caret's train() function:

> grid <- expand.grid(.C = c(1, 2, 3))

> svm_control <- caret::trainControl(method = 'cv', number = 10)

> set.seed(1918)

> svm <- caret::train(x_selected,
train_y,
method = "svmLinear",
trControl = svm_control,
tuneGrid = grid,
metric = "Kappa")

> svm
Support Vector Machines with Linear Kernel

4491 samples
24 predictor
2 classes: '0', '1'

No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 4041, 4042, 4042, 4041, 4042, 4043, ...
Resampling results across tuning parameters:

C Accuracy Kappa
1 0.8372287 0.5223355
2 0.8367833 0.5210972
3 0.8374514 0.5229846

Kappa was used to select the optimal model using the
largest value.
The final value used for the model was C = 3.

Excellent! We have optimal C = 3, so let's build that model. By the way, be sure to specify we want a probability model with prob.model = TRUE. The linear kernel is specified with vanilladot:

> svm_fit <-
kernlab::ksvm(
as.matrix(x_selected),
train_y,
kernel = "vanilladot",
prob.model = TRUE,
kpar = "automatic",
C = 3
)

Do we want a dataframe of predicted probabilities on the train data? I'm glad you asked:

> svm_pred_train <-
kernlab::predict(svm_fit, x_selected, type = "probabilities")

> svm_pred_train <- data.frame(svm_pred_train)

Our density plot in the following looks about as good as what we saw with KNN:

> classifierplots::density_plot(train_y, svm_pred_train$X1)

The output of the preceding code is as follows:

Two things before moving on to the test data, and that is AUC and the optimal score cutoff:

> Metrics::auc(train_y, svm_pred_train$X1)
[1] 0.8940114

> InformationValue::optimalCutoff(train_y, svm_pred_train$X1)
[1] 0.3879227

OK, the AUC is inferior to KNN on the training data, but the proof must be in our test data:

> test_svm <- test[, (colnames(test) %in% svm_vars)]

> svm_pred_test <-
kernlab::predict(svm_fit, test_svm, type = "probabilities")

> svm_pred_test <- as.data.frame(svm_pred_test)

I insist we take a look at the density plot:

> classifierplots::density_plot(test_y, svm_pred_test$`1`)

The output of the preceding code is as follows:

I would put forward that we have a good overall fit here:

> Metrics::auc(test_y, svm_pred_test$`1`)
[1] 0.8951011

That's more like it: excellent bias/variance tradeoff. We can start the overall comparison with KNN by moving forward with the confusion matrix and relevant stats:

> svm_pred_class <- as.factor(ifelse(svm_pred_test$`1` >= 0.275, "1", "0"))

> caret::confusionMatrix(data = svm_pred_class, reference = test_y, positive = "1")
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1206 104
1 247 366

Accuracy : 0.8175
95% CI : (0.7995, 0.8345)
No Information Rate : 0.7556
P-Value [Acc > NIR] : 0.00000000004314737

Kappa : 0.5519
Mcnemar's Test P-Value : 0.00000000000003472

Sensitivity : 0.7787
Specificity : 0.8300
Pos Pred Value : 0.5971
Neg Pred Value : 0.9206
Prevalence : 0.2444
Detection Rate : 0.1903
Detection Prevalence : 0.3188
Balanced Accuracy : 0.8044

'Positive' Class : 1

When you compare the results across methods, we see better values for the SVM almost across the board, especially a better Kappa as well as better balanced accuracy. In the past couple of chapters, we've produced ROC plots where the various models were overlaid on the same plot. We can recreate that same plot here as well, as follows:

> pred.knn <- ROCR::prediction(knn_pred_test$X1, test_y)

> perf.knn <- ROCR::performance(pred.knn, "tpr", "fpr")

> ROCR::plot(perf.knn, main = "ROC", col = 1)

> pred.svm <- ROCR::prediction(svm_pred_test$`1`, test_y)

> perf.svm <- ROCR::performance(pred.svm, "tpr", "fpr")

> ROCR::plot(perf.svm, col = 2, add = TRUE)

> legend(0.6, 0.6, c("KNN", "SVM"), 1:2)

The output of the preceding code is as follows:

 

The plot shows a clear separation in the curves between the two models. Therefore, given what we've done here, the SVM algorithm performed better than KNN. Indeed, we could try a number of different methods to improve either algorithm, which could include a different feature selection and a different weighting for KNN (or kernels for SVM).