Implementing Loss Functions

Loss functions are very important to machine learning algorithms. They measure the distance between the model outputs and the target (truth) values. In this recipe, we show various loss function implementations in TensorFlow.

Getting ready

In order to optimize our machine learning algorithms, we will need to evaluate the outcomes. Evaluating outcomes in TensorFlow depends on specifying a loss function. A loss function tells TensorFlow how good or bad the predictions are compared to the desired result. In most cases, we will have a set of data and a target on which to train our algorithm. The loss function compares the target to the prediction and gives a numerical distance between the two.

For this recipe, we will cover the main loss functions that we can implement in TensorFlow.

To see how the different loss functions operate, we will plot them in this recipe. We will first start a computational graph and load matplotlib, a python plotting library, as follows:

import matplotlib.pyplot as plt
import tensorflow as tf

How to do it…

First we will talk about loss functions for regression, that is, predicting a continuous dependent variable. To start, we will create a sequence of our predictions and a target as a tensor. We will output the results across 500 x-values between -1 and 1. See the next section for a plot of the outputs. Use the following code:

x_vals = tf.linspace(-1., 1., 500)
target = tf.constant(0.)
  1. The L2 norm loss is also known as the Euclidean loss function. It is just the square of the distance to the target. Here we will compute the loss function as if the target is zero. The L2 norm is a great loss function because it is very curved near the target and algorithms can use this fact to converge to the target more slowly, the closer it gets., as follows:
    l2_y_vals = tf.square(target - x_vals)
    l2_y_out = sess.run(l2_y_vals)

    Note

    TensorFlow has a built -in form of the L2 norm, called nn.l2_loss(). This function is actually half the L2-norm above. In other words, it is same as previously but divided by 2.

  2. The L1 norm loss is also known as the absolute loss function. Instead of squaring the difference, we take the absolute value. The L1 norm is better for outliers than the L2 norm because it is not as steep for larger values. One issue to be aware of is that the L1 norm is not smooth at the target and this can result in algorithms not converging well. It appears as follows:
    l1_y_vals = tf.abs(target - x_vals)
    l1_y_out = sess.run(l1_y_vals)
  3. Pseudo-Huber loss is a continuous and smooth approximation to the Huber loss function. This loss function attempts to take the best of the L1 and L2 norms by being convex near the target and less steep for extreme values. The form depends on an extra parameter, delta, which dictates how steep it will be. We will plot two forms, delta1 = 0.25 and delta2 = 5 to show the difference, as follows:
    delta1 = tf.constant(0.25)
    phuber1_y_vals = tf.mul(tf.square(delta1), tf.sqrt(1. + 
                            tf.square((target - x_vals)/delta1)) - 1.)
    phuber1_y_out = sess.run(phuber1_y_vals)
    delta2 = tf.constant(5.)
    phuber2_y_vals = tf.mul(tf.square(delta2), tf.sqrt(1. + 
                            tf.square((target - x_vals)/delta2)) - 1.)
    phuber2_y_out = sess.run(phuber2_y_vals)
  4. Classification loss functions are used to evaluate loss when predicting categorical outcomes.
  5. We will need to redefine our predictions (x_vals) and target. We will save the outputs and plot them in the next section. Use the following:
    x_vals = tf.linspace(-3., 5., 500)
    target = tf.constant(1.)
    targets = tf.fill([500,], 1.)
  6. Hinge loss is mostly used for support vector machines, but can be used in neural networks as well. It is meant to compute a loss between with two target classes, 1 and -1. In the following code, we are using the target value 1, so the as closer our predictions as near are to 1, the lower the loss value:
    hinge_y_vals = tf.maximum(0., 1. - tf.mul(target, x_vals))
    hinge_y_out = sess.run(hinge_y_vals)
  7. Cross-entropy loss for a binary case is also sometimes referred to as the logistic loss function. It comes about when we are predicting the two classes 0 or 1. We wish to measure a distance from the actual class (0 or 1) to the predicted value, which is usually a real number between 0 and 1. To measure this distance, we can use the cross entropy formula from information theory, as follows:
    xentropy_y_vals = - tf.mul(target, tf.log(x_vals)) - tf.mul((1. - target), tf.log(1. - x_vals))
    xentropy_y_out = sess.run(xentropy_y_vals)
  8. Sigmoid cross entropy loss is very similar to the previous loss function except we transform the x-values by the sigmoid function before we put them in the cross entropy loss, as follows:
    xentropy_sigmoid_y_vals = tf.nn.sigmoid_cross_entropy_with_logits(x_vals, targets)
    xentropy_sigmoid_y_out = sess.run(xentropy_sigmoid_y_vals)
  9. Weighted cross entropy loss is a weighted version of the sigmoid cross entropy loss. We provide a weight on the positive target. For an example, we will weight the positive target by 0.5, as follows:
    weight = tf.constant(0.5)
    xentropy_weighted_y_vals = tf.nn.weighted_cross_entropy_with_logits(x_vals, targets, weight)
    xentropy_weighted_y_out = sess.run(xentropy_weighted_y_vals)
  10. Softmax cross-entropy loss operates on non-normalized outputs. This function is used to measure a loss when there is only one target category instead of multiple. Because of this, the function transforms the outputs into a probability distribution via the softmax function and then computes the loss function from a true probability distribution, as follows:
    unscaled_logits = tf.constant([[1., -3., 10.]])
    target_dist = tf.constant([[0.1, 0.02, 0.88]])
    softmax_xentropy = tf.nn.softmax_cross_entropy_with_logits(unscaled_logits, target_dist)
    print(sess.run(softmax_xentropy))
    [ 1.16012561]
  11. Sparse softmax cross-entropy loss is the same as previously, except instead of the target being a probability distribution, it is an index of which category is true. Instead of a sparse all-zero target vector with one value of one, we just pass in the index of which category is the true value, as follows:
    unscaled_logits = tf.constant([[1., -3., 10.]])
    sparse_target_dist = tf.constant([2])
    sparse_xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(unscaled_logits, sparse_target_dist)
    print(sess.run(sparse_xentropy))
    [ 0.00012564]

How it works…

Here is how to use matplotlib to plot the regression loss functions:

x_array = sess.run(x_vals)
plt.plot(x_array, l2_y_out, 'b-', label='L2 Loss')
plt.plot(x_array, l1_y_out, 'r--', label='L1 Loss')
plt.plot(x_array, phuber1_y_out, 'k-.', label='P-Huber Loss (0.25)')
plt.plot(x_array, phuber2_y_out, 'g:', label='P'-Huber Loss (5.0)')
plt.ylim(-0.2, 0.4)
plt.legend(loc='lower right', prop={'size': 11})
plt.show()
How it works…

Figure 4: Plotting various regression loss functions.

And here is how to use matplotlib to plot the various classification loss functions:

x_array = sess.run(x_vals)
plt.plot(x_array, hinge_y_out, 'b-', label='Hinge Loss')
plt.plot(x_array, xentropy_y_out, 'r--', label='Cross Entropy Loss')
plt.plot(x_array, xentropy_sigmoid_y_out, 'k-.', label='Cross Entropy Sigmoid Loss')
plt.plot(x_array, xentropy_weighted_y_out, g:', label='Weighted Cross Enropy Loss (x0.5)')
plt.ylim(-1.5, 3)
plt.legend(loc='lower right', prop={'size': 11})
plt.show()
How it works…

Figure 5: Plots of classification loss functions.

There's more…

Here is a table summarizing the different loss functions that we have described:

The remaining classification loss functions all have to do with the type of cross-entropy loss. The cross-entropy sigmoid loss function is for use on unscaled logits and is preferred over computing the sigmoid, and then the cross entropy, because TensorFlow has better built-in ways to handle numerical edge cases. The same goes for softmax cross entropy and sparse softmax cross entropy.

Note

Most of the classification loss functions described here are for two class predictions. This can be extended to multiple classes via summing the cross entropy terms over each prediction/target.

There are also many other metrics to look at when evaluating a model. Here is a list of some more to consider: