I fell in love with k-Nearest Neighbors algorithm at first sight, but it isn’t blind love. I have plenty of reasons to be mad about it.
1. It’s pretty intuitive and simple
Given that all you need to do is to compare samples, the Nearest Neighbors (k-NN) algorithm is a perfect first step to introduce Machine Learning. It’s very simple to understand, easy to explain and perfect to demonstrate to everyone. Despite its simplicity, Nearest Neighbors is also immensely powerful and used extensively in industry.
2. It makes no assumptions
This is a characteristic of the non-parametric algorithms, like k-NN. They assume nothing on the input data distribution. Parametric models, like linear regressions, assume that most of the data sets obey the typical theoretical constraints. However, real world datasets don’t obey these assumptions. Therefore, k-Nearest Neighbors is often the first option when there is a little or no prior knowledge about the data distribution.
3. There is no training step
It doesn’t require an explicit training step, because there is no model to build. K-NN is an instance-based learning, meaning our algorithm doesn’t explicitly learn a model. The training phase of the algorithm consists only in storing the feature vectors and class labels of the training samples. This also means that the training stage is pretty fast, since data storage is all that is actually happening.
4. It immediately adapts to changes
Given it’s an instance-based learning, k-NN is a memory-based approach. The classifier immediately adapts as we collect new training data. It allows the algorithm to respond quickly to changes in the input during real-time use.
5. It naturally lends itself to multi-class problems
Lots of classification algorithms are by nature binary algorithms. Adapting them to multi-class problems would require a variety of strategies: transformation to binary, extension from binary, etc. However, k-NN naturally allows us the use of more than two classes.
6. It can be used for classification & regression tasks
The natural practice of this method is placing input points into appropriate categories, but we can easily extend this algorithm to regression problems. Instead of combining the discrete labels of k-neighbors we have to combine continuous predictions. These predictions can be obtained in different ways, for example, by averaging the values of the k-most similar instances.
7. Only one hyper parameter to be set
It takes some time to learn which hyper-parameter setting is appropriate for which model behavior. Furthermore, the number of parameters is directly related to the potential for overfitting: the fewer the better.
8. Flexible to distance choice
Different distance functions serve different purposes for different types of datasets. Therefore, you can take your time to choose your best option. Remember your election must depend first on the nature of features. And k-NN admits them all! So it’s possible to use numeric (Euclidean, Manhattan, etc.) and categorical features (Hamming, etc.).
9. It can provide density estimation
While classification remains the main application of k-NN, we can also turn it into a density estimator. For estimating the data density at a point x, we can use the volume of the maximum hypercube that contents all its neighbors. Besides, we can interpret the fraction of neighbors labeled as class A as the conditional probability for this class. In addition, k-NN provides an explicit reject option, if there is no majority agreement. In other words, you can easily change the criterion to estimate the class and get probability estimates instead of fix labels.
10. It’s easy to implement and tune
Implementing k-NN doesn’t require much code and can be quick. Given that, it doesn’t take a lot of time to create your own version, the one that better suits the key points in your specific problem; k-NN can be used as a baseline method. Furthermore, you can find many extensions of k-NN; I overview some of them in the next part of the post.
Issues & keys to deal with them
Of course not everything is a bed of roses regarding k-NN and you need to take into account several tricky points that can affect the algorithm and degrade its performance. Here below I list some of them, but please, don’t feel discouraged about k-NN, everything in life has a solution!
1. Test stage is slow
In contrast with the fast training stage, it requires very expensive testing. All the cost of the algorithm is in the computation of the prediction, since for every test sample, the model has to run through the entire data set to compute distances and then find the nearest neighbors.
Proposal: To deal with the challenging computation of k-NN, some alternative techniques to the brute-force have been created. K-D tree and Ball tree are the most well-known ones.
2. It usually needs homogenous features
If you decide to build k-NN using a common distance, like Euclidean or Manhattan distances, it is completely necessary that features have the same scale, since absolute differences in features weight the same, i.e., a given distance in feature 1 must means the same for feature 2.
Proposal: If your primary data is heterogeneous, you need to rescale data, in order to assure that the distance metric is meaningful. Maybe z-scoring your data to the range [0, 1] is a good idea. A different solution is to design your own distance function, by adapting it to the nature of your features. As long as the input features and the distance criterion match perfectly, everything is ok.
3. It requires to pick the hyper parameter k
The selection of the number of neighbors is a key point and you need to know how this is going to affect your classifier. Choosing a very small k means low bias but high variance, while many voters are involved variance decreases but bias increases.
Proposal: We can find equilibrium with the help of cross-validation. K-fold cross validation is one of the favorites (this k means the number of folds – completely independent of your k neighbors!) and prevents from overfitting by testing the hyper parameter accuracy out of the testing set. Furthermore, we can find different versions of the k-NN algorithm that downplay the k selection: Weighted k-NN uses the inverse of the distance to weight the neighbors classes and this helps to make the similarity metric more decisive than a certain number of neighbors.
4. Imbalanced data causes problems
k-NN doesn’t perform well on imbalanced data. If we consider two classes, A and B, and the majority of the training data is labeled as A, then the model will ultimately give a lot of preference to A. This might result in getting the less common class B wrongly classified.
Proposal: This can be mitigated by bagging class A & B samples separately: We could classify a new sample using several k-nearest neighbors’ sets, where each is obtained by random sampling the available data; we should force the random sampling to select examples with equilibrated classes frequencies. Another solution could be to implement a correction of frequencies by weighting the neighbors’ decisions: more to those labeled with the less frequent class. SMOTE (Synthetic Minority Over-Sampling Technique) and MSMOTE are also methods specifically designed for learning from imbalanced data sets.
5. It’s sensitive to outliers
Algorithm is sensitive to outliers, since a single mislabeled example dramatically changes the class boundaries. Anomalies affect the method significantly, because k-NN gets all the information from the input, rather than from an algorithm that tries to generalize data.
Proposal: Avoid very small number of neighbors (k=1, for example), especially if you are in front of noisy data, so always.
6. It can’t deal with missing values
We need a complete features vector for each instance in order to compute distance. So, any missing value has to be filled.
Proposal: Setting the average value of the feature across the entire dataset for missing values, or restrict the distance calculation to a subspace may be reasonable choices.
7. Distance to the closest neighbors might change a lot for different instances
When you are computing the k-NN algorithm to make decisions, it might occur that sometimes the voters are very close to the new instance to classify. However, in other cases when we lack close examples in the training set, very far away neighbors decide the class of the instance.
Proposal: If your problem needs to preserve how near the voters are, then you should switch to the radios nearest neighbors version. Then, instead of looking for the k nearest neighbors and make decisions, the algorithm will find those examples not more far away than a limit in distance. I really love this version. Note that in this case, the number of votes will vary from one instance to other. It’s even even possible to get some examples miss-classified, if there’s no data sample similar enough.
8. K-NN isn’t suited for high dimensional data
When features vector dimension is huge, k-NN might suffer. This is due to the fact that distance measures often lose accuracy: There is a little difference between the nearest and the farthest neighbors.
Proposal: You should go into Dimensionality Reduction techniques (feature selection and feature extraction) to redefine the features vector before using it as the input of the algorithm.
9. It isn’t robust against irrelevant features
Given that distance metrics normally give the same importance to all the features, it is key to avoid including features that are not relevant to the classification problem. Otherwise, unrelated features will surely degrade accuracy.
Proposal: Feature selection methods, wrappers, filters and embedded methods, consists in selecting a subset of relevant features to use in your model. You can also tune your k-NN by adding custom weights to your features or implementing a boosting algorithm that itself deal with the irrelevance, by random sampling the feature vector and evaluating the error of the subset at every iteration to update features weights according to their importance.
10. Redundancy affects performance
If two features or more are dependent, the algorithm will lay a decompensate weight to them without any reason to do that.
Proposal: Once you have selected your relevant features, you can use feature extraction techniques to get a transformation of your original feature vector into a smaller one that includes no redundant information. PCA is one of the most extended methods.
Do you know any other trick you’d like to share? Do you have a problem you didn’t find an answer to? Feel free to let us know your feelings on k-Nearest Neighbors.