Understanding supervised learning with multiple linear regression

In the previous chapter, we followed an example of linear regression using two variables. It is interesting to see how we can apply regression to more than two variables (called multiple linear regression) and extract useful information from the results.

Suppose that you are asked to test whether there exists a hidden policy of gender discrimination in a company. You could be working for a law firm that is leading a trial against this company, and they need data-based evidence to back up their claim.

You would start by taking a sample of the company's payroll, including several variables that describe each employee and the last salary increase amount. The following screenshot shows a set of values after they've been entered in an Excel worksheet:

There are four numerical features in the dataset:

  • ID: The employee identification, which is not relevant to our analysis
  • Score: The result of the last employee's performance evaluation 
  • Years in company: Years that the employee has worked in the company
  • Salary increase: Amount in dollars of the last salary increase

The remaining two are categorical:

  • Gender: Male (M) or Female (F)
  • Division: In which part of the company the employee works

Categorical values need to be encoded before being used in a model. The final data table is as follows:

The one-hot encoding is easily obtained by applying standard Excel functions. Assuming B2 is the first cell containing the gender classification, we can enter =IF(B2="F";1;0) in cell B21 and copy this value to all cells down to B37.

Depending on which character is defined in the Windows list separator option, you should either use a comma ( ,) or a semi-colon ( ;) in formulas.

To encode the employee's division, we use one-hot encoding (refer to Chapter 1Implementing Machine Learning Algorithms, for a detailed explanation) and create three new variables: IsProduction?, IsResearch?, and IsSales?. We can use Excel functions to calculate the encoding if E2 is the first row containing the Division data, then we can use the =IF(E2="Production";1;0)=IF(E2="Research";1;0), and =IF(E2="Sales";1;0) functions in cells E21, F21, and G21, respectively, and then copy them column-wise down to cells E37, F37, and G37.

Before trying to use regression on the full dataset, we can try some feature engineering. Let's see how well we can predict the salary increase based on which Division each employee works. This will give us an idea of how much the Salary Increase target variable correlates with Division (there will be more details about correlations between variables in Chapter 5, Correlations and the Importance of Variables).

Let's follow some simple steps to use the built-in regression tool:

  1. Navigate to Data.
  2. Click on Data Analysis, as shown in the following screenshot:
  1. Select Regression, as shown in the following screenshot:
  1. As the Input Y Range, select the Salary data and as the Input X Range, select the three Division columns:

The results show R2 = 0.1, meaning that only 10% of the salary increase is related or can be explained by the fact that the employee belongs to a given division. We can therefore discard these columns as input and concentrate on the rest. 

We repeat the regression, now choosing the X values as the columns Gender, Scoreand Years in company.

The results are quite different now, with R2 close to 0.85, meaning that 85% of the salary increase values are explained by the chosen variables.

How important is Gender? Taking a look at the P-value coefficients that Excel gives us, in the following table, we can see that, according to the P-value associated with the input variables, the most important one is gender, followed by the score and the number of years in the company. It is then clear that gender plays an important role when deciding a salary increase, and we have evidence to prove that the company policy is not gender neutral:

 

The output results of the regression tell us how well we can explain the data sample, but cannot give us an accurate measure of how the model will predict a salary increase. To explore this, we should do the following:

  • Obtain a different sample of the payroll (in our case, we could generate new data by hand)
  • Use the coefficients in the previous table to build an expression and calculate the predicted salary increase given the input variables
  • Compare the predicted and real values using root mean square error, as explained in Chapter 1Implementing Machine Learning Algorithms

Let's see if you can finish this exercise; I am hoping that the basic information that's been provided to you carry this out has been understood

We have shown how to perform a multiple linear regression in data to extract interesting insights from them. Let's continue with another important machine learning model: decision trees.