Loading and preparing the data

To get the data into your working directory, you can find it on my GitHub at this link: hhttps://github.com/PacktPublishing/Advanced-Machine-Learning-with-R/blob/master/Data/ames.csv.

The file we're using is ames.csv. This data is from the sales of homes sold in Ames, Iowa, which is the location of Iowa State University, and I believe has a population of around 70,000. I downloaded the data from Kaggle.com, and the response we're trying to predict is the final sales price. It's a nice size to practice machine learning methods with 1,460 observations of 84 features, and many of the features are categorical.

Before we load the data, if not already done, load the necessary packages, call the magrittr library, and, if you so choose, update the options. I prefer not to have scientific number notation and want to round the values to four decimals:

library(magrittr)
options(scipen = 999)
options(digits = 4)
# install.packages("caret")
# install.packages("ggthemes")
# install.packages("janitor")
# install.packages("leaps")
# install.packages("plm")
# install.packages("readr")
# install.packages("sjmisc")
# install.packages("tidyverse")
# install.packages("vtreat")

Now, load the data and confirm the dimensions:

> ames <- readr::read_csv(~/ames.csv")

> dim(ames)

[1] 1460 84

I don't believe there are any duplicate observations, but let's confirm:

> dupes <- duplicated(ames)

> table(dupes)
dupes
FALSE
1460

Excellent! There are no duplicates. Here, we create a tibble of the descriptive statistics. Open the data in RStudio and explore it by feature to get a feel for it:

> ames %>%
sjmisc::descr() -> ames_descr

> View(ames_descr)

There are some thought-provoking features but focus first on Id. Notice that this has a unique value for all of the observations. Hence, we can remove it as it has no value in predictions:

> range(ames$Id)
[1] 1 1460

> ames <- ames[, -1]

Three other features are interesting as they are the year that an event happened. Instead of the year as the value of the feature, how about we create a feature of years since the event? This is easy to do by taking YrSold and subtracting in sequence YearBuilt, YearRemodAdd, and GarageYrBuilt just like this:

 > ames %>%
dplyr::mutate(yearsOld = YrSold - YearBuilt) -> ames

> ames %>%
dplyr::mutate(yearsRemodel = YrSold - YearRemodAdd) -> ames

> ames %>%
dplyr::mutate(yearsGarage = YrSold - GarageYrBlt) -> ames

Another thing of interest when you look at the descriptive stats is the fact that GarageYrBlt has roughly 5.5% missing values. So, yearsGarage will have a corresponding amount of missing values. As is my standard procedure, I want us to code a dummy feature that indicates missing values and changes those missing values to zero. 

I'm not sure that any imputation here would add value:

> ames$yearsGarage_isNA <- 
ifelse(is.na(ames$yearsGarage), 1, 0)

> ames$yearsGarage[is.na(ames$yearsGarage)] <- 0

Let's remove those unnecessary features given that we created a new feature of years since the event:

 > ames <- ames[, c(-19, -20, -59)]

Another feature of interest is MoSold. This is a numeric that corresponds to the month it was sold, so 1 = January, 2 = February, and so on. This is probably better conditioned as a character feature and will end up as dummy features during one-hot encoding:

> ames$MoSold <- as.character(ames$MoSold)

The one plot we should look at is of the response, which is SalesPrice. I like to try out different plot themes, so I'll use different themes for different plots for illustration purposes, which should help you discover your favorite ones:

> ggplot2::ggplot(ames, ggplot2::aes(x = SalePrice)) + 
ggplot2::geom_histogram() +
ggthemes::theme_few()

The output of the preceding code is as follows:

The histogram shows the data is skewed to the right. In non-linear methods, this may not be a problem, but in linear models, you can usually count on biased estimates and/or severe problems with outliers in your residuals. It seems like a good idea to transform this using the natural log:

> ames %>%
dplyr::mutate(logSales = log(SalePrice)) -> ames

> ggplot2::ggplot(ames, ggplot2::aes(x = logSales)) +
ggplot2::geom_histogram() +
ggthemes::theme_economist_white()

The output of the preceding code is as follows:

We now have a much more normal distribution but can see some potentially problematic outliers of homes selling at meager and very high prices.

My usual next step is to finalize any missing values in features of interest. Again, we code a dummy feature and turn the missing values into zero:

> ames$LotFrontage_isNA <- 
ifelse(is.na(ames$LotFrontage), 1, 0)

> ames$LotFrontage[is.na(ames$LotFrontage)] <- 0

> ames$MasVnrArea_isNA <-
ifelse(is.na(ames$MasVnrArea), 1, 0)

> ames$MasVnrArea[is.na(ames$MasVnrArea)] <- 0

I don't believe we have any zero variance features (we removed Id) but let's double-check:

> feature_variance <- caret::nearZeroVar(ames, saveMetrics = TRUE)

> table(feature_variance$zeroVar)

FALSE
84

All good! We now come to the point where we can safely split the data into training and testing sets. I guess you could call the training set a validation set, as the real test data is a separate file that you would submit to Kaggle for evaluation. That is out of scope here; hence, I call our holdout sample test

In this example, let's use an 80/20 split:

> set.seed(1944)

> ames %>%
dplyr::sample_frac(.8) -> train

> ames %>%
dplyr::anti_join(train) -> test

If you look in the Global Environment tab of RStudio, you'll see that train has 1,168 observations and test 292 observations.

We now come to the point where we're almost ready to treat the training data. However, let's create an object called varlist, which we'll feed into the treat function, which is the predictor features, and generate response variables:

> varlist = colnames(ames[, !colnames(ames) %in% c('SalePrice', 'logSales')])

> train_y <- train$SalePrice
> train_logy <- train$logSales
> test_y <- test$SalePrice
> test_logy <- test$logSales

Now you can design a treatment scheme. Do this by only treating the training data, so you don't bias your model building. As you'll see, you can apply that treatment scheme to the test data or any currently unseen data for that matter. We'll just specify our training data, varlist, and set minFraction for coding character feature levels to 10%:

> df_treatment <- vtreat::designTreatmentsZ(
dframe = train,
varlist = varlist,
minFraction = 0.10
)
For a further discussion on designing data treatment strategies, refer to  Chapter 1, Preparing and Understanding Data.

Now, apply the treatment to both train and test datasets:

> trained <- vtreat::prepare(df_treatment, train)

> tested <- vtreat::prepare(df_treatment, test)

Notice that we now have 155 features in each of these treated datasets. Feel free to explore them, keeping in mind how the features are renamed as discussed in Chapter 1Preparing and Understanding Data.

As I stated in the previous chapter, we can drop the _catP features and rename the others as in the following code:

> trained <- 
trained %>%
dplyr::select(-dplyr::contains('_catP'))

> tested <-
tested %>%
dplyr::select(-dplyr::contains('_catP'))

> colnames(trained) <-
sub('_clean', "", colnames(trained))

> colnames(tested) <-
sub('_clean', "", colnames(tested))

> colnames(trained) <-
sub('_isBAD', "_isNA", colnames(trained))

> colnames(tested) <-
sub('_isBAD', "_isNA", colnames(tested))

Just removing the category percentage features reduced the number of them to 114. The final step before moving on to model creation is to remove highly correlated pairs of features and verify there are no linear dependencies. In linear models, this is critical to sort out. During one-hot encoding, if you create as many indicator/dummy features as levels in the parent categorical feature, you would fall into the dummy variable trap, which results in perfect multicollinearity. The classic example is a feature with levels of only male or female. One-hot would give us two features, whereas it should be encoded to one feature with, say, female = 1 and male = 0. Then, in the linear regression, the expectation for male would just be the intercept B0, while for female it would be B0 + B1x.

As for correlation, we could explore the various relationships in depth, as discussed in Chapter 1Preparing and Understanding Data. Given the size of this data, let's identify and remove those pairs with correlation greater than 0.79. I encourage you to experiment with this specification:

> df_corr <- cor(trained)

> high_corr <- caret::findCorrelation(df_corr, cutoff = 0.79)

> length(high_corr)
[1] 19

There are 19 features we can eliminate. As I stated, you can examine this problem in more depth, but let's proceed by merely removing them:

trained <- trained[, -high_corr]

For linear dependencies, the caret package comes in handy again. To be sure, I like to double check with the detect_lin_dep() function:

> caret::findLinearCombos(trained)
$`linearCombos`
list()

$`remove`
NULL

> # linear dependency
> plm::detect_lin_dep(trained)
[1] "No linear dependent column(s) detected."

The results from the caret package tell us there are no features to remove since no dependency exists, and the plm package confirms this.

We'll now move on to training our model. This should be interesting!