Univariate linear regression

We begin by looking at a simple way to predict a quantitative response, Y, with one predictor variable, x, assuming that Y has a linear relationship with x. The model for this can be written as follows:

We can state it as the expected value of Y is a function of the parameters  (the intercept) plus  (the slope) times x, plus an error term e. The least squares approach chooses the model parameters that minimize the Residual Sum of Squares (RSS) of the predicted y values versus the actual Y values. For a simple example, let's say we have the actual values of Y1 and Y2 equal to 10 and 20 respectively, along with the predictions of y1 and y2 as 12 and 18. To calculate RSS, we add the squared differences:

This, with simple substitution, yields the following:

 

Before we begin with an application, I want to point out that if you read the headlines of various research breakthroughs, you should do so with a jaded eye and a skeptical mind as the conclusion put forth by the media may not be valid. As we shall see, Rand any other software, for that matter—will give us a solution regardless of the input. However, just because the math makes sense and a high correlation or R-squared statistic is reported doesn't mean that the conclusion is valid.

To drive this point home, let's have a look at the famous Anscombe dataset, which is available in R. The statistician Francis Anscombe produced this set to highlight the importance of data visualization and outliers when analyzing data. It consists of four pairs of X and Y variables that have the same statistical properties but when plotted show something very different. I've used the data to train colleagues and to educate business partners on the hazards of fixating on statistics without exploring the data and checking assumptions. I think this is an excellent place to start should you have a similar need. It's a brief digression before moving on to serious modeling:

> #call up and explore the data

> data(anscombe)

> attach(anscombe)

> anscombe
x1 x2 x3 x4 y1 y2 y3 y4
1 10 10 10 8 8.04 9.14 7.46 6.58
2 8 8 8 8 6.95 8.14 6.77 5.76
3 13 13 13 8 7.58 8.74 12.74 7.71
4 9 9 9 8 8.81 8.77 7.11 8.84
5 11 11 11 8 8.33 9.26 7.81 8.47
6 14 14 14 8 9.96 8.10 8.84 7.04
7 6 6 6 8 7.24 6.13 6.08 5.25
8 4 4 4 19 4.26 3.10 5.39 12.50
9 12 12 12 8 10.84 9.13 8.15 5.56
10 7 7 7 8 4.82 7.26 6.42 7.91
11 5 5 5 8 5.68 4.74 5.73 6.89

As we shall see, each of the pairs has the same correlation coefficient: 0.816. The first two are as follows:

> cor(x1, y1) #correlation of x1 and y1
[1] 0.8164205

> cor(x2, y2) #correlation of x2 and y2

[1] 0.8164205

The real insight here, as Anscombe intended, is when we plot all four pairs together, as follows:

> par(mfrow = c(2,2)) #create a 2x2 grid for plotting

> plot(x1, y1, main = "Plot 1")

> plot(x2, y2, main = "Plot 2")

> plot(x3, y3, main = "Plot 3")

> plot(x4, y4, main = "Plot 4")
Downloading the example code
You can download the example code files for all Packt books you've purchased from your account at http://www.packtpub.com. If you bought this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files emailed directly to you.

The output of the preceding code is as follows:

We can see the following:

  • Plot 1 appears to have a true linear relationship
  • Plot 2 is curvilinear, Plot 3 has a dangerous outlier
  • Plot 4 is driven by one outlier

There you have it: a cautionary tale about the dangers of solely relying on correlation.