Support vector machines

The first time I heard of support vector machines, I have to admit that I was scratching my head, thinking that this was some form of academic obfuscation or inside joke. However, my fair review of SVM has replaced this natural skepticism with a healthy respect for the technique.

SVMs have been shown to perform well in a variety of settings and are often considered one of the best out-of-the-box classifiers (James, G., 2013). To get a practical grasp of the subject, let's look at another simple visual example. In the following screenshot, you'll see that the classification task is linearly separable. However, the dotted line and solid line are just two among an infinite number of possible linear solutions.

You would have separating hyperplanes in a problem that has more than two dimensions:

So many solutions can be problematic for generalization because, whatever solution you choose, any new observation to the right of the line will be classified as benign, and to the left of the line, it'll be classified as malignant. Therefore, either line has no bias on the train data but may have a widely divergent error on any data to test. This is where the support vectors come into play. The probability that a point falls on the wrong side of the linear separator is higher for the dotted line than the solid line, which means that the solid line has a higher margin of safety for classification. Therefore, as Battiti and Brunato say, SVMs are linear separators with the largest possible margin and the support vectors the ones touching the safety margin region on both sides.

The following screenshot illustrates this idea. The thin solid line is the optimal linear separator to create the aforementioned largest possible margin, hence increasing the probability that a new observation will fall on the correct side of the separator. The thicker black lines correspond to the safety margin, and the shaded data points constitute the support vectors. If the support vectors were to move, then the margin and, subsequently, the decision boundary would change. The distance between the separators is known as the margin:

This is all fine and dandy, but real-world problems aren't so clear-cut.

In data that isn't linearly separable, many observations will fall on the wrong side of the margin (these are so-called slack variables), which is a misclassification. The key to building an SVM algorithm is to solve for the optimal number of support vectors via cross-validation. Any observation that lies directly on the wrong side of the margin for its class is known as a support vector.

If the tuning parameter for the number of errors is too large, which means that you have many support vectors, you'll suffer from a high bias and low variance. On the other hand, if the tuning parameter is too small, the opposite might occur. According to James et al., who refer to the tuning parameter as C, as C decreases, the tolerance for observations being on the wrong side of the margin decreases and the margin narrows. This C, or rather, the cost function, allows for observations to be on the wrong side of the margin. If C were set to zero, then we would prohibit a solution where any observation violates the margin. This is a hyperparameter you can tune to optimize bias/variance.

Another essential aspect of SVM is the ability to model nonlinearity with quadratic or higher order polynomials of the input features. In SVMs, this is known as the kernel trick. These can be estimated and selected with cross-validation. In the example, we'll look at the alternatives.

As with any model, you can expand the number of features using polynomials to various degrees, interaction terms, or other derivations. In large datasets, the possibilities can quickly get out of control. The kernel trick with SVMs allows us to efficiently expand the feature space, with the goal that you achieve an approximate linear separation.

To check out how this is done, let's first look at the SVM optimization problem and its constraints. We're trying to achieve the following:

  • Creating weights that maximize the margin
  • Subject to the constraints, no (or as few as possible) data points should lie within that margin

Now, unlike linear regression, where each observation is multiplied by a weight, in SVM, the weights are applied to the inner products of just the support vector observations.

What does this mean? Well, an inner product for two vectors is just the sum of the paired observations' product. For example, if vector one is 3, 4, and 2 and vector two is 1, 2, and 3, then you end up with (3x1) + (4x2) + (2x3) or 17. With SVMs, if we take a possibility that an inner product of each observation has an inner product of every other observation, this amounts to the formula that there would be n(n-1)/2 combinations, where n is the number of observations. With just 10 observations, we end up with 45 inner products. However, SVM only concerns itself with the support vectors' observations and their corresponding weights. For a linear SVM classifier, the formula is as follows:

Here, (x, xi) are the inner products of the support vectors, as α is non-zero only when an observation is a support vector.

This leads to far fewer terms in the classification algorithm and allows the use of the kernel function, commonly referred to as the kernel trick.

The trick in this is that the kernel function mathematically summarizes the transformation of the features in higher dimensions instead of creating them explicitly. In a simplistic sense, a kernel function computes a dot product between two vectors. This has the benefit of creating the higher dimensional, nonlinear space, and decision boundary while keeping the optimization problem computationally efficient. The kernel functions compute the inner product in a higher dimensional space without transforming them into the higher dimensional space.

The notation for popular kernels is expressed as the inner (dot) product of the features, with xi and xj representing vectors, gamma, and c parameters, as follows:

As for the selection of the nonlinear techniques, they require some trial and error, but we'll walk through the various selection techniques.