Simple Regression Using TensorFlow

This section will explain, step by step, how to successfully tackle a regression problem. You will learn how to take a preliminary look at the dataset to understand its most important properties, as well as how to prepare it to be used during training, validation, and inference. Then, a deep neural network will be built from a clean sheet using TensorFlow via the Keras API. This model will then be trained and its performance will be evaluated.

In a regression problem, the aim is to predict the output of a continuous value, such as a price or a probability. In this exercise, the classic Auto MPG dataset will be used and a deep neural network will be trained on it to accurately predict car fuel efficiency, using no more than the following seven features: Cylinders, Displacement, Horsepower, Weight, Acceleration, Model Year, and Origin.

The dataset can be thought of as a table with eight columns (seven features, plus one target value) and as many rows as instances the dataset has. As per the best practices we looked at in the previous sections, it will be pided as follows: 20% of the total number of instances will create the test set, while the remaining ones will be split again into training and validation sets with an 80:20 ratio.

As a first step, the training set will be inspected for missing values, and cleaned if needed. Then, a chart showing variable correlation will be plotted. The only categorical variable present will be converted into numerical form via one-hot encoding. Finally, all the features will be normalized.

The deep learning model will then be created. A three-layered fully connected architecture will be used: the first and the second layer will have 64 nodes, while the last one, being the output layer of a regression problem, will have only one node.

Standard choices for the loss function (mean squared error) and optimizer (RMSprop) will be applied. Training will then be performed with and without early stopping to highlight the different effects they have on training and validation loss.

Finally, the model will be applied to the test set to evaluate performances and make predictions.

Exercise 3.05: Creating a Deep Neural Network to Predict the Fuel Efficiency of Cars

In this exercise, we will build, train, and measure performances of a deep neural network model that predicts car fuel efficiency using only seven car features: Cylinders, Displacement, Horsepower, Weight, Acceleration, Model Year, and Origin.

The step-by-step procedure for this is as follows:

  1. Import all the required modules and print the versions of the most important ones:

    from __future__ import absolute_import, pision, \

    print_function, unicode_literals

    import matplotlib.pyplot as plt

    import numpy as np

    import pandas as pd

    import seaborn as sns

    import tensorflow as tf

    print("TensorFlow version: {}".format(tf.__version__))

    The output will be as follows:

    TensorFlow version: 2.1.0

  2. Import the Auto MPG dataset, read it with pandas, and show the last five rows:

    dataset_path = tf.keras.utils.get_file("auto-mpg.data", \

                   "https://raw.githubusercontent.com/"\

                   "PacktWorkshops/"\

                   "The-Reinforcement-Learning-Workshop/master/"\

                   "Chapter03/Dataset/auto-mpg.data")

    column_names = ['MPG','Cylinders','Displacement','Horsepower',\

                    'Weight', 'Acceleration', 'Model Year', 'Origin']

    raw_dataset = pd.read_csv(dataset_path, names=column_names,\

                              na_values = "?", comment='\t',\

                              sep=" ", skipinitialspace=True)

    dataset = raw_dataset.copy()

    dataset.tail()

    Note

    Watch out for the slashes in the string below. Remember that the backslashes ( \ ) are used to split the code across multiple lines, while the forward slashes ( / ) are part of the URL.

    The output will be as follows:

    Figure 3.12: Last five rows of the dataset imported in pandas

  3. Let's clean the data from unknown values. Check how much Not available data is present and where:

    dataset.isna().sum()

    This produces the following output:

    MPG 0

    Cylinders 0

    Displacement 0

    Horsepower 6

    Weight 0

    Acceleration 0

    Model Year 0

    Origin 0

    dtype: int64

  4. Given the small number of rows with unknown values, simply drop them:

    dataset = dataset.dropna()

  5. Use one-hot encoding for the Origin variable, which is categorical:

    dataset['Origin'] = dataset['Origin']\

                        .map({1: 'USA', 2: 'Europe', 3: 'Japan'})

    dataset = pd.get_dummies(dataset, prefix='', prefix_sep='')

    dataset.tail()

    The output will be as follows:

    Figure 3.13: Last five rows of the dataset imported into pandas using one-hot encoding

  6. Split the data into training and test sets with an 80:20 ratio:

    train_dataset = dataset.sample(frac=0.8,random_state=0)

    test_dataset = dataset.drop(train_dataset.index)

  7. Now, let's take a look at some training data statistics, that is, the joint distributions of some pairs of features from the training set, using the seaborn module. The pairplot command takes in the features of the dataset as input to evaluate them, couple by couple. Along the diagonal (where the couple is composed of two instances of the same feature), it shows the distribution of the variable, while in the off-diagonal terms, it shows the scatterplot of the two features. This is useful if we wish to highlight correlations:

    sns.pairplot(train_dataset[["MPG", "Cylinders", "Displacement", \

                                "Weight"]], diag_kind="kde")

    This generates the following image:

    Figure 3.14: Joint distributions of some pairs of features from the training set

  8. Let's now take a look at the overall statistics:

    train_stats = train_dataset.describe()

    train_stats.pop("MPG")

    train_stats = train_stats.transpose()

    train_stats

    The output will be as follows:

    Figure 3.15: Overall training set statistics

  9. Split the features from the labels and normalize the data:

    train_labels = train_dataset.pop('MPG')

    test_labels = test_dataset.pop('MPG')

    def norm(x):

        return (x - train_stats['mean']) / train_stats['std']

    normed_train_data = norm(train_dataset)

    normed_test_data = norm(test_dataset)

  10. Now, let's look at the model's creation and a summary of the same:

    def build_model():

        model = tf.keras.Sequential([

                tf.keras.layers.Dense(64, activation='relu',\

                                      input_shape=[len\

                                      (train_dataset.keys())]),\

                tf.keras.layers.Dense(64, activation='relu'),\

                tf.keras.layers.Dense(1)])

        optimizer = tf.keras.optimizers.RMSprop(0.001)

        model.compile(loss='mse', optimizer=optimizer,\

                      metrics=['mae', 'mse'])

        return model

    model = build_model()

    model.summary()

    This generates the following output:

    Figure 3.16: Model summary

  11. Use the fit model function to train the network for 1,000 epochs by using a validation set of 20%:

    epochs = 1000

    history = model.fit(normed_train_data, train_labels,\

                        epochs=epochs, validation_split = 0.2, \

                        verbose=2)

    This will produce a very long output. We will only report the last few lines here:

    Epoch 999/1000251/251 - 0s - loss: 2.8630 - mae: 1.0763

    - mse: 2.8630 - val_loss: 10.2443 - val_mae: 2.3926

    - val_mse: 10.2443

    Epoch 1000/1000251/251 - 0s - loss: 2.7697 - mae: 0.9985

    - mse: 2.7697 - val_loss: 9.9689 - val_mae: 2.3709 - val_mse: 9.9689

  12. Visualize the training and validation metrics by plotting the MAE and MSE.

    The following snippet plots the MAE:

    hist = pd.DataFrame(history.history)

    hist['epoch'] = history.epoch

    plt.plot(hist['epoch'],hist['mae'])

    plt.plot(hist['epoch'],hist['val_mae'])

    plt.ylim([0, 10])

    plt.ylabel('MAE [MPG]')

    plt.legend(["Training", "Validation"])

    The output will be as follows:

    Figure 3.17: Mean absolute error over the plot of epochs

    The preceding figure shows how increasing the training epochs causes the validation error to grow, meaning the system is experiencing an overfitting problem.

  13. Now, let's visualize the MSE using a plot:

    plt.plot(hist['epoch'],hist['mse'])

    plt.plot(hist['epoch'],hist['val_mse'])

    plt.ylim([0, 20])

    plt.ylabel('MSE [MPG^2]')

    plt.legend(["Training", "Validation"])

    The output will be as follows:

    Figure 3.18: Mean squared error over the plot of epochs

    Also, in this case, the figure shows how increasing the training epochs causes the validation error to grow, meaning the system is experiencing an overfitting problem.

  14. Use Keras callbacks to add early stopping (with the patience parameter equal to 10 epochs) to avoid overfitting. First of all, build the model:

    model = build_model()

  15. Then, define an early stopping callback. This entity will be passed to the model.fit function and will be called every fit step to check whether the validation error stops decreasing for more than 10 consecutive epochs:

    early_stop = tf.keras.callbacks\

                 .EarlyStopping(monitor='val_loss', patience=10)

  16. Finally, call the fit method with the early stop callback:

    early_history = model.fit(normed_train_data, train_labels,\

                              epochs=epochs, validation_split=0.2,\

                              verbose=2, callbacks=[early_stop])

    The last few lines of the output are as follows:

    Epoch 42/1000251/251 - 0s - loss: 7.1298 - mae: 1.9014

    - mse: 7.1298 - val_loss: 8.1151 - val_mae: 2.1885

    - val_mse: 8.1151

    Epoch 43/1000251/251 - 0s - loss: 7.0575 - mae: 1.8513

    - mse: 7.0575 - val_loss: 8.4124 - val_mae: 2.2669

    - val_mse: 8.4124

  17. Visualize the train and validation metrics for early stopping. Firstly, collect all the training history data and put it into a pandas DataFrame, for both the metric and epoch values:

    early_hist = pd.DataFrame(early_history.history)

    early_hist['epoch'] = early_history.epoch

  18. Then, plot the training and validation MAE against the epochs, limiting the max y values to 10:

    plt.plot(early_hist['epoch'],early_hist['mae'])

    plt.plot(early_hist['epoch'],early_hist['val_mae'])

    plt.ylim([0, 10])

    plt.ylabel('MAE [MPG]')

    plt.legend(["Training", "Validation"])

    The preceding code will produce the following output:

    Figure 3.19: Mean absolute error over the plot of epochs (early stopping)

    As demonstrated by the preceding figure, training is stopped as soon as the validation error stops decreasing, thereby avoiding overfitting.

  19. Evaluate the model accuracy on the test set:

    loss, mae, mse = model.evaluate(normed_test_data, \

                                    test_labels, verbose=2)

    print("Testing set Mean Abs Error: {:5.2f} MPG".format(mae))

    The output will be as follows:

    78/78 - 0s - loss: 6.3067 - mae: 1.8750 - mse: 6.3067

    Testing set Mean Abs Error: 1.87 MPG

    Note

    The accuracy may show slightly different values due to random sampling with a variable random seed.

  20. Finally, perform model inference by predicting all the MPG values for all test instances. Then, plot these values with respect to their true values so that you have a visual estimation of the model error:

    test_predictions = model.predict(normed_test_data).flatten()

    a = plt.axes(aspect='equal')

    plt.scatter(test_labels, test_predictions)

    plt.xlabel('True Values [MPG]')

    plt.ylabel('Predictions [MPG]')

    lims = [0, 50]

    plt.xlim(lims)

    plt.ylim(lims)

    _ = plt.plot(lims, lims)

    The output will be as follows:

Figure 3.20: Predictions versus ground truth scatterplot

The scatterplot puts predicted values versus true values in correspondence with one another, which means that the closer the points are to the diagonal line, the more accurate the predictions will be. It is evident how clustered the points are, meaning predictions are fairly accurate.

Note

To access the source code for this specific section, please refer to https://packt.live/3feCLNN.

You can also run this example online at https://packt.live/37n5WeM.

This section has shown how to successfully tackle a regression problem. The selected dataset has been imported, cleaned, and subpided into training, validation, and test sets. Then, a brief exploratory data analysis was carried out before a three-layered fully connected deep neural network was created. The network has been successfully trained and its performance has been evaluated on the test set.

Now, let's study classification problems using TensorFlow.