Use case – building and applying a neural network

To close the chapter, we will discuss a more realistic use case for neural networks. We will use a public dataset by Anguita, D., Ghio, A., Oneto, L., Parra, X., and Reyes-Ortiz, J. L. (2013) that uses smartphones to track physical activity. The data can be downloaded at https://archive.ics.uci.edu/ml/datasets/human+activity+recognition+using+smartphones. The smartphones had an accelerometer and gyroscope from which 561 features from both time and frequency were used.

The smartphones were worn during walking, walking upstairs, walking downstairs, standing, sitting, and lying down. Although this data came from phones, similar measures could be derived from other devices designed to track activity, such as various fitness-tracking watches or bands. So this data can be useful if we want to sell devices and have them automatically track how many of these different activities the wearer engages in.

This data has already been normalized to range from -1 to + 1; usually we might want to perform some normalization if it has not already been applied. Download the data from the link and unzip it into the data folder that is on the same level as the chapter folder; we will use it in later chapters as well. We can import the training and testing data, as well as the labels. We will then take a quick look at the distribution of the outcome variable in the following code:

use.train.x <- read.table("../data/UCI HAR Dataset/train/X_train.txt")
use.train.y <- read.table("../data/UCI HAR Dataset/train/y_train.txt")[[1]]

use.test.x <- read.table("../data/UCI HAR Dataset/test/X_test.txt")
use.test.y <- read.table("../data/UCI HAR Dataset/test/y_test.txt")[[1]]

use.labels <- read.table("../data/UCI HAR Dataset/activity_labels.txt")

barplot(table(use.train.y),main="Distribution of y values (UCI HAR Dataset)")

This produces the following bar plot, which shows that the categories are relatively evenly balanced:

Figure 2.9: Distribution of y values for UCI HAR dataset

We are going to evaluate a variety of tuning parameters to show how we might experiment with different approaches to try to get the best possible model. We will use different hyper-parameters and evaluate which model performs the best. 

Because the models can take some time to train and R normally only uses a single core, we will use some special packages to enable us to run multiple models in  parallel. These packages are parallel, foreach, and doSNOW, which should have been loaded if you ran the script from the first line.

Now we can pick our tuning parameters and set up a local cluster as the backend for the foreach R package for parallel for loops. Note that if you do this on a machine with fewer than five cores, you should change makeCluster(5) to a lower number:

## choose tuning parameters
tuning <- list(
size = c(40, 20, 20, 50, 50),
maxit = c(60, 100, 100, 100, 100),
shuffle = c(FALSE, FALSE, TRUE, FALSE, FALSE),
params = list(FALSE, FALSE, FALSE, FALSE, c(0.1, 20, 3)))

## setup cluster using 5 cores
## load packages, export required data and variables
## and register as a backend for use with the foreach package
cl <- makeCluster(5)
clusterEvalQ(cl, {source("cluster_inc.R")})
clusterExport(cl,
c("tuning", "use.train.x", "use.train.y",
"use.test.x", "use.test.y")
)
registerDoSNOW(cl)

Now we are ready to train all the models. The following code shows a parallel for loop, using code that is similar to what we have already seen, but this time setting some of the arguments based on the tuning parameters we previously stored in the list:

## train models in parallel
use.models <- foreach(i = 1:5, .combine = 'c') %dopar% {
if (tuning$params[[i]][1]) {
set.seed(42)
list(Model = mlp(
as.matrix(use.train.x),
decodeClassLabels(use.train.y),
size = tuning$size[[i]],
learnFunc = "Rprop",
shufflePatterns = tuning$shuffle[[i]],
learnFuncParams = tuning$params[[i]],
maxit = tuning$maxit[[i]]
))
} else {
set.seed(42)
list(Model = mlp(
as.matrix(use.train.x),
decodeClassLabels(use.train.y),
size = tuning$size[[i]],
learnFunc = "Rprop",
shufflePatterns = tuning$shuffle[[i]],
maxit = tuning$maxit[[i]]
))
}
}

Because generating out-of-sample predictions can also take some time, we will do that in parallel as well. However, first we need to export the model results to each of the workers on our cluster, and then we can calculate the predictions:

## export models and calculate both in sample,
## 'fitted' and out of sample 'predicted' values
clusterExport(cl, "use.models")
use.yhat <- foreach(i = 1:5, .combine = 'c') %dopar% {
list(list(
Insample = encodeClassLabels(fitted.values(use.models[[i]])),
Outsample = encodeClassLabels(predict(use.models[[i]],
newdata = as.matrix(use.test.x)))
))
}

Finally, we can merge the actual and fitted or predicted values together into a dataset, calculate performance measures on each one, and store the overall results together for examination and comparison. We can use almost identical code to the code that follows to generate out-of-sample performance measures. That code is not shown in the book, but is available in the code bundle provided with the book. Some additional data-management is required here as sometimes a model may not predict each possible response level, but this can make for non-symmetrical frequency cross tabs, unless we convert the variable to a factor and specify the levels. We also drop o values, which indicate the model was uncertain about how to classify an observation:

use.insample <- cbind(Y = use.train.y,
do.call(cbind.data.frame, lapply(use.yhat, `[[`, "Insample")))
colnames(use.insample) <- c("Y", paste0("Yhat", 1:5))

performance.insample <- do.call(rbind, lapply(1:5, function(i) {
f <- substitute(~ Y + x, list(x = as.name(paste0("Yhat", i))))
use.dat <- use.insample[use.insample[,paste0("Yhat", i)] != 0, ]
use.dat$Y <- factor(use.dat$Y, levels = 1:6)
use.dat[, paste0("Yhat", i)] <- factor(use.dat[, paste0("Yhat", i)], levels = 1:6)
res <- caret::confusionMatrix(xtabs(f, data = use.dat))

cbind(Size = tuning$size[[i]],
Maxit = tuning$maxit[[i]],
Shuffle = tuning$shuffle[[i]],
as.data.frame(t(res$overall[c("AccuracyNull", "Accuracy", "AccuracyLower", "AccuracyUpper")])))
}))

use.outsample <- cbind(Y = use.test.y,
do.call(cbind.data.frame, lapply(use.yhat, `[[`, "Outsample")))
colnames(use.outsample) <- c("Y", paste0("Yhat", 1:5))
performance.outsample <- do.call(rbind, lapply(1:5, function(i) {
f <- substitute(~ Y + x, list(x = as.name(paste0("Yhat", i))))
use.dat <- use.outsample[use.outsample[,paste0("Yhat", i)] != 0, ]
use.dat$Y <- factor(use.dat$Y, levels = 1:6)
use.dat[, paste0("Yhat", i)] <- factor(use.dat[, paste0("Yhat", i)], levels = 1:6)
res <- caret::confusionMatrix(xtabs(f, data = use.dat))

cbind(Size = tuning$size[[i]],
Maxit = tuning$maxit[[i]],
Shuffle = tuning$shuffle[[i]],
as.data.frame(t(res$overall[c("AccuracyNull", "Accuracy", "AccuracyLower", "AccuracyUpper")])))
}))

If we print the in-sample and out-of-sample performance, we can see how each of our models did and the effect of varying some of the tuning parameters. The output is shown in the following code. The fourth column (null accuracy) is dropped as it is not as important for this comparison:

options(width = 80, digits = 3)
performance.insample[,-4]
Size Maxit Shuffle Accuracy AccuracyLower AccuracyUpper
1 40 60 FALSE 0.984 0.981 0.987
2 20 100 FALSE 0.982 0.978 0.985
3 20 100 TRUE 0.982 0.978 0.985
4 50 100 FALSE 0.981 0.978 0.984
5 50 100 FALSE 1.000 0.999 1.000

performance.outsample[,-4]
Size Maxit Shuffle Accuracy AccuracyLower AccuracyUpper
1 40 60 FALSE 0.916 0.906 0.926
2 20 100 FALSE 0.913 0.902 0.923
3 20 100 TRUE 0.913 0.902 0.923
4 50 100 FALSE 0.910 0.900 0.920
5 50 100 FALSE 0.938 0.928 0.946

As a reminder, the in-sample results evaluate the predictions on the training data and the out-sample results evaluate the predictions on the holdout (or test) data. The best set of hyper-parameters is the last set, where we get an accuracy of 93.8% on unseen data. This shows that we are able to classify the types of activity people are engaged in quite accurately based on the data from their smartphones. We can also see that the more complex models perform better on the in-sample data, which is not always the case with out-of-sample performance measures.

For each model, we have large differences between the accuracy for the in-sample data against the out-of-sample data; the models clearly overfit. We will get into ways to combat this overfitting in Chapter 3,  Deep Learning Fundamentals, as we train deep neural networks with multiple hidden layers.

Despite the slightly worse out-of-sample performance, the models still do well – far better than chance alone  and, for our example use case, we could pick the best model and be quite confident that using this will provide a good classification of a user's activities.