Data preparation

What we should do now is create our training and test data using a 70/30 split. Then, we should subject it to the standard feature exploration we started discussing in Chapter 1, Preparing and Understanding Data, with these tasks in mind:

  • Eliminate low variance features
  • Identify and remove linear dependencies
  • Explore highly correlated features

The first thing then is for us to turn the numeric outcome into a factor to be used for creating a stratified data index, like so:

> y_factor <- as.factor(y)

> set.seed(1492)

> index <- caret::createDataPartition(y_factor, p = 0.7, list = F)

Using the index, we create train/test input features and labels:

> train <- x[index, ]

> train_y <- y_factor[index]

> test <- x[-index, ]

> test_y <- y_factor[-index]

With our training data in hand, let's find and eliminate the low variance features, which I can state in advance are quite a few:

> train_NZV <- caret::nearZeroVar(train, saveMetrics = TRUE)

> table(train_NZV$nzv)

FALSE TRUE
48 74

> table(train_NZV$zeroVar)

FALSE TRUE
121 1

We see that 74 features are low variance, and one of those is zero variance. Let's rid ourselves of these pesky features:

> train_r <- train[train_NZV$nzv == FALSE]

Given our new dataframe of reduced features, we now identify and eliminate linear dependency combinations:

> linear_combos <- caret::findLinearCombos(x = train_r)

> linear_combos
$`linearCombos`
$`linearCombos`[[1]]
[1] 13 1 2 3 4 5 9 10 11 12

$`linearCombos`[[2]]
[1] 19 16

$`linearCombos`[[3]]
[1] 20 15

$`linearCombos`[[4]]
[1] 22 1 2 3 4 5 15 16 18 21

$`linearCombos`[[5]]
[1] 40 1 2 3 4 5 39

$`linearCombos`[[6]]
[1] 42 1 2 3 4 5 41

$`linearCombos`[[7]]
[1] 47 1 2 3 4 5 43 44 45 46

$remove
[1] 13 19 20 22 40 42 47

The output provides a list of 7 linear dependencies and recommends the removal of 7 features. The number in $remove corresponds to the column index number in the dataframe. For example, in combination number 2, the indices would be indicative of the column names, V36 and V22. Here's a table of these two features for demonstration purposes:

> table(train_r$V36, train_r$V22)

0 1
0 3032 0
1 0 1459

It's clear these two features are measuring the same thing. We'll remove those recommended, but there's one more thing to discuss. When doing cross-validation during the modeling process, you may run into warnings that linear dependencies exist even though you ran this methodology. I found that to be the case with this dataset in the modeling exercises that follow. After some exploration of features V1 through V5, I found that, by dropping V5, this was no longer a problem. Let's proceed with that in mind:

> train_r <- train_r[, -linear_combos$remove]

> train_r <- train_r[, -5]

> plm::detect_lin_dep(train_r)
[1] "No linear dependent column(s) detected."

Here we can check if there're any correlations over 0.7, and remove a feature if it's highly correlated with another:

> high_corr <- caret::findCorrelation(my_data_cor, cutoff = 0.7)

> high_corr
[1] 29

> train_df <- train_r[, -high_corr]

The code found and removed the feature with a column index of 30 and 34. We now have a dataframe ready for modeling. If you want to look at a correlation heatmap, then run this handy function from the DataExplorer package:

> DataExplorer::plot_correlation(train_df)

The output of the preceding code is as follows:

Notice that features V67 and V71 are highly correlated. In a real-world setting, this would probably warrant further investigation, but we'll feed both into our learning algorithms, as no subject matter expert can tell us otherwise.

We can now proceed with our model training, starting with KNN, then SVM, and comparing their performance.