L1 penalty

The basic concept of the L1 penalty, also known as the least-absolute shrinkage and selection operator (Lasso–Hastie, T., Tibshirani, R., and Friedman, J. (2009)), is that a penalty is used to shrink weights toward zero. The penalty term uses the sum of the absolute weights, so some weights may get shrunken to zero. This means that Lasso can also be used as a type of variable selection. The strength of the penalty is controlled by a hyper-parameter, alpha (λ), which multiplies the sum of the absolute weights, and it can be a fixed value or, as with other hyper-parameters, optimized using cross-validation or some similar approach.

It is easier to describe Lasso if we use an ordinary least squares (OLS) regression model. In regression, a set of coefficients or model weights is estimated using the least-squared error criterion, where the weight/coefficient vector, Θ, is estimated such that it minimizes ∑(y_i - ?_i) where ?_i=b+Θx, y_i is the target value we want to predict and ?_i is the predicted value. Lasso regression adds an additional penalty term that now tries to minimize ∑(y_i - ?_i) + λ?Θ?, where ?Θ? is the absolute value of Θ. Typically, the intercept or offset term is excluded from this constraint.

There are a number of practical implications for Lasso regression. First, the effect of the penalty depends on the size of the weights, and the size of the weights depends on the scale of the data. Therefore, data is typically standardized to have unit variance first (or at least to make the variance of each variable equal). The L1 penalty has a tendency to shrink small weights to zero (for explanations as to why this happens, see Hastie, T., Tibshirani, R., and Friedman, J. (2009)). If you only consider variables for which the L1 penalty leaves non-zero weights, it can essentially function as feature-selection. The tendency for the L1 penalty to shrink small coefficients to zero can also be convenient for simplifying the interpretation of the model results.

Applying the L1 penalty to neural networks works exactly the same for neural networks as it does for regression. If X represents the input, Y is the outcome or dependent variable, B the parameters, and F the objective function that will be optimized to obtain B, that is, we want to minimize F(B; X, Y). The L1 penalty modifies the objective function to be F(B; X, Y) + λ?Θ?, where Θ represents the weights (typically offsets are ignored). The L1 penalty tends to result in a sparse solution (that is, more zero weights) as small and larger weights result in equal penalties, so that at each update of the gradient, the weights are moved toward zero.

We have only considered the case where λ is a constant, controlling the degree of penalty or regularization. However, it is possible to set different values with deep neural networks, where varying degrees of regularization can be applied to different layers. One reason for considering such differential regularization is that it is sometimes desirable to allow a greater number of parameters (say by including more neurons in a particular layer) but then counteract this somewhat through stronger regularization. However, this approach can be quite computationally demanding if we are allowing the L1 penalty to vary for every layer of a deep neural network and using cross-validation to optimize all possible combinations of the L1 penalty. Therefore, usually a single value is used across the entire model.