Qualitative features

A qualitative feature, also referred to as a factor, can take on two or more levels such as Male/Female or Bad/Neutral/Good. If we have a feature with two levels, say gender, then we can create what is known as an indicator or dummy feature, arbitrarily assigning one level as 0 and the other as 1. If we create a model with just the indicator, our linear model would still follow the same formulation as before, that is, Y = B0 + B1x + e. If we code the feature as male being equal to 0 and female equal to 1, then the expectation for male would just be the intercept B0, while for female it would be B0 + B1x. In the situation where you have more than two levels of the feature, you can create n-1 indicators; so, for three levels you would have two indicators. If you created as many indicators as levels, you would fall into the dummy variable trap, which results in perfect multi-collinearity.

We can examine a simple example to learn how to interpret the output. Let's load the ISLR package and build a model with the Carseats dataset using the following code snippet:

    > library(ISLR)

> data(Carseats)

> str(Carseats)

'data.frame': 400 obs. of 11 variables:
$ Sales : num 9.5 11.22 10.06 7.4 4.15 ...
$ CompPrice : num 138 111 113 117 141 124 115 136
132 132 ...

$ Income : num 73 48 35 100 64 113 105 81 110
113 ...

$ Advertising: num 11 16 10 4 3 13 0 15 0 0 ...
$ Population : num 276 260 269 466 340 501 45 425
108 131 ...

$ Price : num 120 83 80 97 128 72 108 120 124
124 ...

$ ShelveLoc : Factor w/ 3 levels
"Bad","Good","Medium": 1 2 3 3 1
1 3 2 3 3 ...

$ Age : num 42 65 59 55 38 78 71 67 76 76
...

$ Education : num 17 10 12 14 13 16 15 10 10 17
...

$ Urban : Factor w/ 2 levels "No","Yes": 2 2 2
2 2 1 2 2 1 1
...

$ US : Factor w/ 2 levels "No","Yes": 2 2 2
2 1 2 1 2 1 2
..

For this example, we will predict the sales of Carseats using just Advertising, a quantitative feature and the qualitative feature ShelveLoc, which is a factor of three levels: Bad, Good, and Medium. With factors, R will automatically code the indicators for the analysis. We build and analyze the model as follows:

    > sales.fit <- lm(Sales ~ Advertising + ShelveLoc, 
data = Carseats)


> summary(sales.fit)

Call:
lm(formula = Sales ~ Advertising + ShelveLoc, data =
Carseats)

Residuals:
Min 1Q Median 3Q Max
-6.6480 -1.6198 -0.0476 1.5308 6.4098

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.89662 0.25207 19.426 < 2e-
16 ***

Advertising 0.10071 0.01692 5.951 5.88e-
09 ***

ShelveLocGood 4.57686 0.33479 13.671 < 2e-
16 ***

ShelveLocMedium 1.75142 0.27475 6.375 5.11e-
10 ***

---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05
'.' 0.1 ' ' 1


Residual standard error: 2.244 on 396 degrees of
freedom

Multiple R-squared: 0.3733, Adjusted R-squared:
0.3685

F-statistic: 78.62 on 3 and 396 DF, p-value: <
2.2e-16

If the shelving location is good, the estimate of sales is almost double of that when the location is bad, given an intercept of 4.89662. To see how R codes the indicator features, you can use the contrasts() function:

    > contrasts(Carseats$ShelveLoc)

Good Medium
Bad 0 0
Good 1 0
Medium 0 1