Business understanding

Our first case focuses on the goal of predicting the water yield (in inches) of the Snake River Watershed in Wyoming, USA, as a function of the water content of the year's snowfall. This forecast will be useful in managing the water flow and reservoir levels as the Snake River provides much-needed irrigation water for the farms and ranches of several western states. The snake dataset is available in the alr3 package (note that alr stands for applied linear regression):

    > install.packages("alr3")
> library(alr3)
> data(snake)
> dim(snake)
[1] 17 2
> head(snake)
X Y
1 23.1 10.5
2 32.8 16.7
3 31.8 18.2
4 32.0 17.0
5 30.4 16.3
6 24.0 10.5

Now that we have 17 observations, data exploration can begin. But first, let's change X and Y to meaningful variable names, as follows:

    > names(snake) <- c("content", "yield")
> attach(snake) # attach data with new names
> head(snake)

content yield
1 23.1 10.5
2 32.8 16.7
3 31.8 18.2
4 32.0 17.0
5 30.4 16.3
6 24.0 10.5

> plot(content, yield, xlab = "water content of
snow", ylab = "water yield")

The output of the preceding code is as follows:

This is an interesting plot as the data is linear and has a slight curvilinear shape driven by two potential outliers at both ends of the extreme. As a result, transforming the data or deleting an outlying observation may be warranted.

To perform a linear regression in R, one uses the lm() function to create a model in the standard form of fit = lm(Y ~ X). You can then test your assumptions using various functions on your fitted model by using the following code:

    > yield.fit <- lm(yield ~ content)

> summary(yield.fit)

Call:
lm(formula = yield ~ content)

Residuals:
Min 1Q Median 3Q Max
-2.1793 -1.5149 -0.3624 1.6276 3.1973

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.72538 1.54882 0.468 0.646
content 0.49808 0.04952 10.058 4.63e-08
***

---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05
'.' 0.1 ' ' 1


Residual standard error: 1.743 on 15 degrees of
freedom

Multiple R-squared: 0.8709, Adjusted R-squared:
0.8623

F-statistic: 101.2 on 1 and 15 DF, p-value:
4.632e-08

With the summary() function, we can examine a number of items including the model specification, descriptive statistics about the residuals, the coefficients, codes to model significance, and a summary on model error and fit. Right now, let's focus on the parameter coefficient estimates, see if our predictor variable has a significant p-value, and if the overall model F-test has a significant p-value. Looking at the parameter estimates, the model tells us that the yield is equal to 0.72538 plus 0.49808 times the content. It can be stated that, for every 1 unit change in the content, the yield will increase by 0.49808 units. The F-statistic is used to test the null hypothesis that the model coefficients are all 0.

Since the p-value is highly significant, we can reject the null and move on to the t-test for content, which tests the null hypothesis that it is 0. Again, we can reject the null. Additionally, we can see Multiple R-squared and Adjusted R-squared values. Adjusted R-squared will be covered under the multivariate regression topic, so let's zero in on Multiple R-squared; here we see that it is 0.8709. In theory, it can range from 0 to 1 and is a measure of the strength of the association between X and Y. The interpretation in this case is that 87 percent of the variation in the water yield can be explained by the water content of snow. On a side note, R-squared is nothing more than the correlation coefficient of [X, Y] squared.

We can recall our scatterplot and now add the best fit line produced by our model using the following code:

    > plot(content, yield)

> abline(yield.fit, lwd=3, col="red")

The output of the preceding code is as follows:

A linear regression model is only as good as the validity of its assumptions, which can be summarized as follows:

  • Linearity: This is a linear relationship between the predictor and the response variables. If this relationship is not clearly present, transformations (log, polynomial, exponent, and so on) of X or Y may solve the problem.
  • Non-correlation of errors: A common problem in time series and panel data where en = betan-1; if the errors are correlated, you run the risk of creating a poorly specified model.
  • Homoscedasticity: Normally the distributed and constant variance of errors, which means that the variance of errors is constant across different values of inputs. Violations of this assumption can create biased coefficient estimates, leading to statistical tests for significance that can be either too high or too low. This, in turn, leads to a wrong conclusion. This violation is referred to as heteroscedasticity.
  • No collinearity: No linear relationship between two predictor variables, which is to say that there should be no correlation between the features. This, again, can lead to biased estimates.
  • Presence of outliers: Outliers can severely skew the estimation, and ideally they must be removed prior to fitting a model using linear regression; As we saw in the Anscombe example, this can lead to a biased estimate.

As we are building a univariate model independent of time, we will concern ourselves only with linearity and heteroscedasticity. The other assumptions will become important in the next section. The best way to initially check the assumptions is by producing plots. The plot() function, when combined with a linear model fit, will automatically produce four plots allowing you to examine the assumptions. R produces the plots one at a time and you advance through them by hitting the Enter key. It is best to examine all four simultaneously and we do it in the following manner:

    > par(mfrow = c(2,2))

> plot(yield.fit)

The output of the preceding code is as follows:

The two plots on the left allow us to examine the homoscedasticity of errors and non-linearity. What we are looking for is some type of pattern or, more importantly, that no pattern exists. Given the sample size of only 17 observations, nothing obvious can be seen. Common heteroscedastic errors will appear to be u-shaped, inverted u-shaped, or clustered close together on the left of the plot. They will become wider as the fitted values increase (a funnel shape). It is safe to conclude that no violation of homoscedasticity is apparent in our model.

The Normal Q-Q plot in the upper-right corner helps us to determine if the residuals are normally distributed. The Quantile-Quantile (Q-Q) represents the quantile values of one variable plotted against the quantile values of another. It appears that the outliers (observations 7, 9, and 10), may be causing a violation of the assumption. The Residuals vs Leverage plot can tell us what observations, if any, are unduly influencing the model; in other words, if there are any outliers we should be concerned about. The statistic is Cook's distance or Cook's D, and it is generally accepted that a value greater than 1 should be worthy of further inspection.

What exactly is further inspection? This is where art meets science. The easy way out would be to simply delete the observation, in this case number 9, and redo the model. However, a better option may be to transform the predictor and/or the response variables. If we just delete observation 9, then maybe observations 10 and 13 would fall outside the band for greater than 1. I believe that this is where domain expertise can be critical. More times than I can count, I have found that exploring and understanding outliers can yield valuable insights. When we first examined the previous scatterplot, I pointed out the potential outliers and these happen to be observations number 9 and number 13. As an analyst, it would be critical to discuss with the appropriate subject matter experts to understand why this is the case. Is it a measurement error? Is there a logical explanation for these observations? I certainly don't know, but this is an opportunity to increase the value that you bring to an organization.

Having said that, we can drill down on the current model by examining, in more detail, the Normal Q-Q plot. R does not provide confidence intervals to the default Q-Q plot, and given our concerns in looking at the base plot, we should check the confidence intervals. The qqPlot() function of the car package automatically provides these confidence intervals. Since the car package is loaded along with the alr3 package, I can produce the plot with one line of code:

    > qqPlot(yield.fit)

The output of the preceding code is as follows:

According to the plot, the residuals are normally distributed. I think this can give us the confidence to select the model with all the observations. A clear rationale and judgment would be needed to attempt other models. If we could clearly reject the assumption of normally distributed errors, then we would probably have to examine the variable transformations and/or observation deletion.