Modeling and evaluation

One of the key elements that we will cover here is the very important task of feature selection. In this chapter, we will discuss the best subsets regression methods stepwise, using the leaps package. Later chapters will cover more advanced techniques.

Forward stepwise selection starts with a model that has zero features; it then adds the features one at a time until all the features are added. A selected feature is added in the process that creates a model with the lowest RSS. So in theory, the first feature selected should be the one that explains the response variable better than any of the others, and so on.

It is important to note that adding a feature will always decrease RSS and increase R-squared, but it will not necessarily improve the model fit and interpretability.

Backward stepwise regression begins with all the features in the model and removes the least useful, one at a time. A hybrid approach is available where the features are added through forward stepwise regression, but the algorithm then examines if any features that no longer improve the model fit can be removed. Once the model is built, the analyst can examine the output and use various statistics to select the features they believe provide the best fit.

It is important to add here that stepwise techniques can suffer from serious issues. You can perform a forward stepwise on a dataset, then a backward stepwise, and end up with two completely conflicting models. The bottomline is that stepwise can produce biased regression coefficients; in other words, they are too large and the confidence intervals are too narrow (Tibshirani, 1996).

Best subsets regression can be a satisfactory alternative to the stepwise methods for feature selection. In best subsets regression, the algorithm fits a model for all the possible feature combinations; so if you have 3 features, 7 models will be created. As with stepwise regression, the analyst will need to apply judgment or statistical analysis to select the optimal model. Model selection will be the key topic in the discussion that follows. As you might have guessed, if your dataset has many features, this can be quite a task, and the method does not perform well when you have more features than observations (p is greater than n).

Certainly, these limitations for best subsets do not apply to our task at hand. Given its limitations, we will forgo stepwise, but please feel free to give it a try. We will begin by loading the leaps package. In order that we may see how feature selection works, we will first build and examine a model with all the features, then drill down with best subsets to select the best fit.

To build a linear model with all the features, we can again use the lm() function. It will follow the form: fit = lm(y ~ x1 + x2 + x3...xn). A neat shortcut, if you want to include all the features, is to use a period after the tilde symbol instead of having to type them all in. For starters, let's load the leaps package and build a model with all the features for examination as follows:

    > library(leaps)

> fit <- lm(BSAAM ~ ., data = socal.water)

> summary(fit)

Call:
lm(formula = BSAAM ~ ., data = socal.water)

Residuals:
Min 1Q Median 3Q Max
-12690 -4936 -1424 4173 18542

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15944.67 4099.80 3.889 0.000416
***

APMAM -12.77 708.89 -0.018 0.985725
APSAB -664.41 1522.89 -0.436 0.665237
APSLAKE 2270.68 1341.29 1.693 0.099112 .
OPBPC 69.70 461.69 0.151 0.880839
OPRC 1916.45 641.36 2.988 0.005031 **
OPSLAKE 2211.58 752.69 2.938 0.005729 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05
'.' 0.1 '' 1

Residual standard error: 7557 on 36 degrees of
freedom

Multiple R-squared: 0.9248, Adjusted R-squared:
0.9123

F-statistic: 73.82 on 6 and 36 DF, p-value: <
2.2e-16

Just like univariate regression, we examine the p-value on the F-statistic to verify that at least one of the coefficients is not zero.  Indeed, the p-value is highly significant. We should also have significant p-values for the OPRC and OPSLAKE parameters. Interestingly, OPBPC is not significant despite being highly correlated with the response variable. In short, when we control for the other OP features, OPBPC no longer explains any meaningful variation of the predictor, which is to say that the feature OPBPC adds nothing from a statistical standpoint with OPRC and OPSLAKE in the model.

With the first model built, let's move on to best subsets. We create the sub.fit object using the regsubsets() function of the leaps package as follows:

    > sub.fit <- regsubsets(BSAAM ~ ., data = 
socal.water)

Then we create the best.summary object to examine the models further. As with all R objects, you can use the names() function to list what outputs are available:

    > best.summary <- summary(sub.fit)

> names(best.summary)
[1] "which" "rsq" "rss" "adjr2" "cp"
"bic" "outmat" "obj"

Other valuable functions in model selection include which.min() and which.max(). These functions will provide the model that has the minimum or maximum value respectively, as shown in following code snippet:

    > which.min(best.summary$rss)
[1] 6

The code tells us that the model with six features has the smallest RSS, which it should have, as that is the maximum number of inputs and more inputs mean a lower RSS. An important point here is that adding features will always decrease RSS! Furthermore, it will always increase R-squared. We could add a completely irrelevant feature such as the number of wins for the Los Angeles Lakers and RSS would decrease and R-squared would increase. The amount would likely be miniscule, but present nonetheless. As such, we need an effective method to properly select the relevant features.

For feature selection, there are four statistical methods that we will talk about in this chapter: Aikake's Information Criterion (AIC), Mallow's Cp (Cp), Bayesian Information Criterion (BIC), and the adjusted R-squared. With the first three, the goal is to minimize the value of the statistic; with adjusted R-squared, the goal is to maximize the statistics value. The purpose of these statistics is to create as parsimonious a model as possible, in other words, to penalize model complexity.

The formulation of these four statistics is as follows: 

In a linear model, AIC and Cp are proportional to each other, so we will only concern ourselves with Cp, which follows the output available in the leaps package. BIC tends to select the models with fewer variables than Cp, so we will compare both. To do so, we can create and analyze two plots side by side. Let's do this for Cp, followed by BIC,with the help of the following code snippet:

    > par(mfrow = c(1,2))

> plot(best.summary$cp, xlab = "number of
features", ylab = "cp")


> plot(sub.fit, scale = "Cp")

The output of the preceding code snippet is as follows:

In the plot on the left-hand side, the model with three features has the lowest cp. The plot on the right-hand side displays those features that provide the lowest Cp. The way to read this plot is to select the lowest Cp value at the top of the y axis, which is 1.2. Then, move to the right and look at the colored blocks corresponding to the x axis. Doing this, we see that APSLAKE, OPRC, and OPSLAKE are the features included in this specific model. By using the which.min() and which.max() functions, we can identify how cp compares to BIC and the adjusted R-squared:

    > which.min(best.summary$bic)
[1] 3

> which.max(best.summary$adjr2)
[1] 3

In this example, BIC and adjusted R-squared match the Cp for the optimal model. Now, just as with univariate regression, we need to examine the model and test the assumptions. We'll do this by creating a linear model object and examining the plots much as we did earlier, as follows:

    > best.fit <- lm(BSAAM ~ APSLAKE + OPRC + OPSLAKE, 
data =
socal.water)


> summary(best.fit)
Call:
lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE)

Residuals:
Min 1Q Median 3Q Max
-12964 -5140 -1252 4446 18649

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15424.6 3638.4 4.239 0.000133
***

APSLAKE 1712.5 500.5 3.421 0.001475 **
OPRC 1797.5 567.8 3.166 0.002998 **
OPSLAKE 2389.8 447.1 5.346 4.19e-06
***

---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05
'.' 0.1 ' ' 1


Residual standard error: 7284 on 39 degrees of
freedom

Multiple R-squared: 0.9244, Adjusted R-squared:
0.9185

F-statistic: 158.9 on 3 and 39 DF, p-value: <
2.2e-16

With the three-feature model, F-statistic and all the t-tests have significant p-values. Having passed the first test, we can produce our diagnostic plots:

    > par(mfrow = c(2,2))

> plot(best.fit)

The output of the preceding code snippet is as follows:

Looking at the plots, it seems safe to assume that the residuals have a constant variance and are normally distributed. There is nothing in the leverage plot that would indicate a requirement for further investigation.

To investigate the issue of collinearity, one can call up the Variance Inflation Factor (VIF) statistic. VIF is the ratio of the variance of a feature's coefficient, when fitting the full model, Pided by the feature's coefficient variance when fitted by itself. The formula is 1 / (1-R2i), where R2i is the R-squared for our feature of interest, i, being regressed by all the other features. The minimum value that the VIF can take is 1, which means no collinearity at all. There are no hard and fast rules, but in general a VIF value that exceeds 5 (or some say 10) indicates a problematic amount of collinearity (James, p.101, 2013). A precise value is difficult to select, because there is no hard statistical cut-off point for when multi-collinearity makes your model unacceptable.

The vif() function in the car package is all that is needed to produce the values, as can be seen in the following code snippet:

    > vif(best.fit)

APSLAKE OPRC OPSLAKE
1.011499 6.452569 6.444748

It shouldn't be surprising that we have a potential collinearity problem with OPRC and OPSLAKE (values greater than 5) based on the correlation analysis. A plot of the two variables drives the point home, as seen in the following screenshot:

    > plot(socal.water$OPRC, socal.water$OPSLAKE, xlab 
= "OPRC", ylab = "OPSLAKE")

The output of the preceding command is as follows:

The simple solution to address collinearity is to drop the variables to remove the problem without compromising the predictive ability. If we look at the adjusted R-squared from the best subsets, we can see that the two-variable model of APSLAKE and OPSLAKE has produced a value of 0.90, while adding OPRC has only marginally increased it to 0.92:

    > best.summary$adjr2 #adjusted r-squared values
[1] 0.8777515 0.9001619 0.9185369 0.9168706
0.9146772 0.9123079

Let's have a look at the two-variable model and test its assumptions:

    > fit.2 <- lm(BSAAM ~ APSLAKE+OPSLAKE, data = 
socal.water)


> summary(fit.2)

Call:
lm(formula = BSAAM ~ APSLAKE + OPSLAKE)

Residuals:
Min 1Q Median 3Q Max
-13335.8 -5893.2 -171.8 4219.5 19500.2

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 19144.9 3812.0 5.022 1.1e-05
***

APSLAKE 1768.8 553.7 3.194 0.00273 **
OPSLAKE 3689.5 196.0 18.829 < 2e-16
***

---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05
'.' 0.1 ' ' 1


Residual standard error: 8063 on 40 degrees of
freedom

Multiple R-squared: 0.9049, Adjusted R-squared:
0.9002

F-statistic: 190.3 on 2 and 40 DF, p-value: <
2.2e-16


> par(mfrow=c(2,2))

> plot(fit.2)

The output of the preceding code snippet is as follows:

The model is significant, and the diagnostics do not seem to be a cause for concern. This should take care of our collinearity problem as well and we can check that using the vif() function again:

    > vif(fit.2)

APSLAKE OPSLAKE
1.010221 1.010221

As I stated previously, I don't believe the plot of fits versus residuals is of concern, but if you have any doubts you can formally test the assumption of the constant variance of errors in R. This test is known as the Breusch-Pagan (BP) test. For this, we need to load the lmtest package, and run one line of code. The BP test has the null hypothesis that the error variances are zero versus the alternative of not zero:

    > library(lmtest)

> bptest(fit.2)

studentized Breusch-Pagan test

data: fit.2
BP = 0.0046, df = 2, p-value = 0.9977

We do not have evidence to reject the null that implies that the error variances are zero because p-value = 0.9977. The BP = 0.0046 value in the summary of the test is the chi-squared value.

All things considered, it appears that the best predictive model is with the two features APSLAKE and OPSLAKE. The model can explain 90 percent of the variation in the stream runoff volume. To forecast the runoff, it would be equal to 19,145 (the intercept) plus 1,769 times the measurement at APSLAKE plus 3,690 times the measurement at OPSLAKE. A scatterplot of the Predicted vs. Actual values can be done in base R using the fitted values from the model and the response variable values as follows:

    > plot(fit.2$fitted.values, socal.water$BSAAM, xlab 
= "predicted", ylab = "actual", main = "Predicted
vs.Actual")

The output of the preceding code snippet is as follows:

Although informative, the base graphics of R are not necessarily ready for a presentation to be made to business partners. However, we can easily spruce up this plot in R. Several packages to improve graphics are available for this example, I will use ggplot2. Before producing the plot, we must put the predicted values into our data frame, socal.water. I also want to rename BSAAM as Actual and put in a new vector within the data frame, as shown in the following code snippet:

    > socal.water["Actual"] = water$BSAAM #create the 
vector Actual


> socal.water$Forecast = predict(fit.2) #populate
Forecast with the predicted values

Next we will load the ggplot2 package and produce a nicer graphic with just one line of code:

    > library(ggplot2)

> ggplot(socal.water, aes(x = Forecast, y =
Actual)) + geom_point() + geom_smooth(method =
lm) + labs(title = "Forecast versus Actuals")

The output is as follows:

Let's examine one final model selection technique before moving on. In the upcoming chapters, we will be discussing cross-validation at some length. Cross-validation is a widely used and effective method of model selection and testing. Why is this necessary at all? It comes down to the bias-variance trade-off. Professor Tarpey of Wright State University has a nice quote on the subject:

"Often we use regression models to predict future observations. We can use our data to fit the model. However, it is cheating to then access how well the model predicts responses using the same data that was used to estimate the model - this will tend to give overly optimistic results in terms of how well a model is able to predict future observations. If we leave out an observation, fit the model and then predict the left out response, then this will give a less biased idea of how well the model predicts."

The cross-validation technique discussed by Professor Tarpey in the preceding quote is known as the Leave-One-Out Cross-Validation (LOOCV). In linear models, you can easily perform an LOOCV by examining the Prediction Error Sum of Squares (PRESS) statistic and selecting the model that has the lowest value. The R library MPV will calculate the statistic for you, as shown in the following code:

    > library(MPV)  

> PRESS(best.fit)
[1] 2426757258

> PRESS(fit.2)
[1] 2992801411

By this statistic alone, we could select our best.fit model. However, as described previously, I still believe that the more parsimonious model is better in this case. You can build a simple function to calculate the statistic on your own, taking advantage of some elegant matrix algebra as shown in the following code:

    > PRESS.best = sum((resid(best.fit)/(1 - 
hatvalues(best.fit)))^2)


> PRESS.fit.2 = sum((resid(fit.2)/(1 -
hatvalues(fit.2)))^2)


> PRESS.best
[1] 2426757258

> PRESS.fit.2
[1] 2992801411

"What are hatvalues?" you might ask. Well, if we take our linear model Y = B0 + B1x + e, we can turn this into a matrix notation: Y = XB + E. In this notation, Y remains unchanged, X is the matrix of the input values, B is the coefficient, and E represents the errors. This linear model solves for the value of B. Without going into the painful details of matrix multiplication, the regression process yields what is known as a Hat Matrix. This matrix maps, or as some say projects, the calculated values of your model to the actual values; as a result, it captures how influential a specific observation is in your model. So, the sum of the squared residuals Pided by 1 minus hatvalues is the same as LOOCV.