Overview
Definition
K Nearest Neighbors (k-NN) is a non-parametric, instance-based supervised learning algorithm. It classifies or predicts values for a query point by analyzing the k closest labeled instances in the training data.
History
Introduced by Fix and Hodges (1951), formalized by Cover and Hart (1967). One of the simplest, earliest supervised learning algorithms.
Key Characteristics
Lazy learning: no explicit training phase. Instance-based: stores entire dataset. Uses distance measures for proximity. Applicable to classification and regression.
"The k-nearest neighbor algorithm is a simple, effective method that leverages the proximity of data points in feature space to make predictions." -- Cover & Hart, 1967
Algorithm
Step-by-Step Procedure
1. Choose k, the number of neighbors. 2. Compute distances from query point to all training points. 3. Identify k nearest neighbors by distance. 4. For classification: majority vote among neighbors. For regression: average/neighborhood aggregation.
Pseudocode
Input: Training set D, query point q, number of neighbors kFor each point p in D: Compute distance d = dist(p, q)Sort points by d ascendingSelect k closest points: N_kIf classification: Return class with highest frequency in N_kElse if regression: Return average target value of N_kDecision Boundaries
Non-linear, shaped by local neighbor distribution. Boundaries adapt flexibly to data geometry.
Distance Metrics
Euclidean Distance
Most common metric. Formula: square root of sum of squared differences across features.
Manhattan Distance
Sum of absolute differences across features. Robust to outliers.
Other Metrics
Minkowski distance (generalized), Chebyshev distance, Mahalanobis distance (accounts for feature correlation), cosine similarity (for text/angles).
Euclidean: d(p,q) = sqrt(Σ (p_i - q_i)^2)Manhattan: d(p,q) = Σ |p_i - q_i|Minkowski: d(p,q) = (Σ |p_i - q_i|^m)^(1/m)Mahalanobis: d(p,q) = sqrt((p-q)^T S^(-1) (p-q))Feature Scaling
Importance
Distance-based model sensitive to feature magnitude. Unscaled features distort neighbor selection.
Common Methods
Min-max normalization: rescales to [0,1]. Standardization: zero mean, unit variance. Robust scaling: median and IQR.
Effect on Performance
Proper scaling improves accuracy, reduces bias towards features with larger ranges.
| Scaling Method | Formula | Effect |
|---|---|---|
| Min-Max | (x - min) / (max - min) | Scales features to [0,1] |
| Standardization | (x - μ) / σ | Centers data, unit variance |
| Robust Scaling | (x - median) / IQR | Resilient to outliers |
Classification
Majority Voting
Assign class label by majority vote among k neighbors. Tie-breakers use weighted voting or lower k.
Distance-Weighted Voting
Neighbors weighted inversely by distance. Closer neighbors influence decision more strongly.
Evaluation Metrics
Accuracy, precision, recall, F1-score, confusion matrix used to assess classification quality.
| Metric | Definition |
|---|---|
| Accuracy | (TP + TN) / Total samples |
| Precision | TP / (TP + FP) |
| Recall | TP / (TP + FN) |
| F1-Score | 2 * (Precision * Recall) / (Precision + Recall) |
Regression
Prediction Method
Output is average or weighted average of neighbors' continuous values.
Weighting Schemes
Uniform weights or inverse distance weights to emphasize nearer points.
Performance Metrics
Mean Squared Error (MSE), Mean Absolute Error (MAE), R-squared used for evaluation.
Regression prediction:y(q) = (1/k) * Σ y_i for i in k neighborsor weighted:y(q) = Σ w_i * y_i / Σ w_i, w_i = 1 / d(q, x_i)Parameter Selection
Choosing k
Small k: sensitive to noise, overfitting. Large k: smooths boundaries, underfitting. Cross-validation used to optimize k.
Distance Metric Choice
Depends on data type, dimensionality, correlation. Mahalanobis for correlated features, cosine for text.
Weighting Options
Uniform vs distance-weighted influence. Weighting often improves performance, especially with noisy data.
Advantages and Limitations
Advantages
Simple, intuitive, no training required, flexible decision boundaries, effective with sufficient data, supports multi-class problems.
Limitations
High computational cost at prediction, sensitive to irrelevant/noisy features, requires careful feature scaling, suffers from curse of dimensionality.
Mitigation Strategies
Use dimensionality reduction, feature selection, approximate nearest neighbor algorithms, optimized data structures like KD-trees or ball trees.
Computational Complexity
Training Phase
O(1) or O(n) for storing dataset; no explicit model building.
Prediction Phase
O(n * d) per query (n = training size, d = dimensionality) for naive search. Accelerated methods reduce this.
Data Structures for Efficiency
KD-trees: efficient for low dimensional data. Ball trees and cover trees for higher dimensions. Approximate nearest neighbor reduces complexity further.
Applications
Pattern Recognition
Handwriting recognition, image classification, speech recognition.
Medical Diagnosis
Predict disease presence, prognosis based on patient features.
Recommendation Systems
User-based collaborative filtering, item similarity detection.
Anomaly Detection
Identify outliers by distance from normal data clusters.
Implementation Details
Data Storage
Stores entire training set in memory. Memory usage scales linearly with dataset size and dimensionality.
Algorithm Optimization
Use efficient nearest neighbor search algorithms, early stopping criteria, dimensionality reduction to speed up predictions.
Software Libraries
Available in scikit-learn (Python), Weka (Java), MATLAB, R (class package), TensorFlow addons.
Extensions and Variants
Radius Neighbors
Considers all neighbors within a fixed radius instead of fixed k.
Weighted k-NN
Weights neighbors by inverse distance or kernel functions.
Condensed Nearest Neighbor
Reduces dataset size by removing redundant points.
Fuzzy k-NN
Assigns membership probabilities rather than hard labels.
Adaptive k-NN
Vary k dynamically based on local data density.
References
- T. M. Cover and P. E. Hart, "Nearest neighbor pattern classification," IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 21-27, 1967.
- R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed., Wiley-Interscience, 2000.
- L. I. Kuncheva, "Nearest neighbor classifier: Simplicity and effectiveness," Pattern Recognition Letters, vol. 31, no. 6, pp. 654-658, 2010.
- C. E. Brodley and M. A. Friedl, "Identifying mislabeled training data," Journal of Artificial Intelligence Research, vol. 11, pp. 131-167, 1999.
- J. Wang, J. Yang, J. L. Huang, and Y. Zhang, "A review on nearest neighbor algorithms," Neural Computing and Applications, vol. 32, no. 4, pp. 1043-1066, 2020.