- Advanced Machine Learning with R
- Cory Lesmeister Dr. Sunil Kumar Chinnamgari
- 517字
- 2021-06-24 14:24:40
K-nearest neighbors
In our previous efforts, we built models that had coefficients or, to put it in another way, parameter estimates for each of our included features. With KNN, we have no parameters as the learning method is so-called instance-based learning. In short, labeled examples (inputs and corresponding output labels) are stored, and no action is taken until a new input pattern demands an output value (Battiti and Brunato, 2014, p. 11). This method is commonly called lazy learning, as no specific model parameters are produced. The train instances themselves represent the knowledge. For the prediction of any new instance (a new data point), the training data is searched for an instance that most resembles the new instance in question. KNN does this for a classification problem by looking at the closest points—the nearest neighbors—to determine the proper class. The k comes into play by deciding how many neighbors should be examined by the algorithm, so if k=5, it will consider the five nearest points. A weakness of this method is that all five points are given equal weight in the algorithm even if they're less relevant in learning. We'll look at methods using R and try to alleviate this issue.
The best way to understand how this works is with a simple visual example of a binary classification learning problem. In the following screenshot, we have a plot showing whether a tumor is benign or malignant based on two predictive features. The X in the plot indicates a new observation that we would like to predict. If our algorithm considers K=3, the circle encompasses the three observations that are nearest to the one that we want to score. As the most commonly occurring classifications are malignant, the X data point is classified as malignant, as shown in the following screenshot:
Even from this simple example, it's clear that the selection of k for the nearest neighbors is critical. If k is too small, you may have a high variance on the test set observations even though you have a low bias. On the other hand, as k grows, you may decrease your variance, but the bias may be unacceptable. Cross-validation is necessary to determine the proper k.
It's also important to point out the calculation of the distance or the nearness of the data points in our feature space. The default distance is Euclidean distance. This is merely the straight-line distance from point A to point B—as the crow flies—or you can utilize the formula that states it's equivalent to the square root of the sum of the squared differences between the corresponding points. The formula for Euclidean distance, given point A and B with coordinates p1, p2, ... pn and q1, q2, ... qn respectively, would be as follows:
This distance is highly dependent on the scale that the features were measured on, so it's critical to standardize them. Other distance calculations as well as weights can be used, depending on the distance. We'll explore this in the upcoming example.