Overview

Definition

K Nearest Neighbors (k-NN) is a non-parametric, instance-based supervised learning algorithm. It classifies or predicts values for a query point by analyzing the k closest labeled instances in the training data.

History

Introduced by Fix and Hodges (1951), formalized by Cover and Hart (1967). One of the simplest, earliest supervised learning algorithms.

Key Characteristics

Lazy learning: no explicit training phase. Instance-based: stores entire dataset. Uses distance measures for proximity. Applicable to classification and regression.

"The k-nearest neighbor algorithm is a simple, effective method that leverages the proximity of data points in feature space to make predictions." -- Cover & Hart, 1967

Algorithm

Step-by-Step Procedure

1. Choose k, the number of neighbors. 2. Compute distances from query point to all training points. 3. Identify k nearest neighbors by distance. 4. For classification: majority vote among neighbors. For regression: average/neighborhood aggregation.

Pseudocode

Input: Training set D, query point q, number of neighbors kFor each point p in D: Compute distance d = dist(p, q)Sort points by d ascendingSelect k closest points: N_kIf classification: Return class with highest frequency in N_kElse if regression: Return average target value of N_k

Decision Boundaries

Non-linear, shaped by local neighbor distribution. Boundaries adapt flexibly to data geometry.

Distance Metrics

Euclidean Distance

Most common metric. Formula: square root of sum of squared differences across features.

Manhattan Distance

Sum of absolute differences across features. Robust to outliers.

Other Metrics

Minkowski distance (generalized), Chebyshev distance, Mahalanobis distance (accounts for feature correlation), cosine similarity (for text/angles).

Euclidean: d(p,q) = sqrt(Σ (p_i - q_i)^2)Manhattan: d(p,q) = Σ |p_i - q_i|Minkowski: d(p,q) = (Σ |p_i - q_i|^m)^(1/m)Mahalanobis: d(p,q) = sqrt((p-q)^T S^(-1) (p-q))

Feature Scaling

Importance

Distance-based model sensitive to feature magnitude. Unscaled features distort neighbor selection.

Common Methods

Min-max normalization: rescales to [0,1]. Standardization: zero mean, unit variance. Robust scaling: median and IQR.

Effect on Performance

Proper scaling improves accuracy, reduces bias towards features with larger ranges.

Scaling MethodFormulaEffect
Min-Max(x - min) / (max - min)Scales features to [0,1]
Standardization(x - μ) / σCenters data, unit variance
Robust Scaling(x - median) / IQRResilient to outliers

Classification

Majority Voting

Assign class label by majority vote among k neighbors. Tie-breakers use weighted voting or lower k.

Distance-Weighted Voting

Neighbors weighted inversely by distance. Closer neighbors influence decision more strongly.

Evaluation Metrics

Accuracy, precision, recall, F1-score, confusion matrix used to assess classification quality.

MetricDefinition
Accuracy(TP + TN) / Total samples
PrecisionTP / (TP + FP)
RecallTP / (TP + FN)
F1-Score2 * (Precision * Recall) / (Precision + Recall)

Regression

Prediction Method

Output is average or weighted average of neighbors' continuous values.

Weighting Schemes

Uniform weights or inverse distance weights to emphasize nearer points.

Performance Metrics

Mean Squared Error (MSE), Mean Absolute Error (MAE), R-squared used for evaluation.

Regression prediction:y(q) = (1/k) * Σ y_i for i in k neighborsor weighted:y(q) = Σ w_i * y_i / Σ w_i, w_i = 1 / d(q, x_i)

Parameter Selection

Choosing k

Small k: sensitive to noise, overfitting. Large k: smooths boundaries, underfitting. Cross-validation used to optimize k.

Distance Metric Choice

Depends on data type, dimensionality, correlation. Mahalanobis for correlated features, cosine for text.

Weighting Options

Uniform vs distance-weighted influence. Weighting often improves performance, especially with noisy data.

Advantages and Limitations

Advantages

Simple, intuitive, no training required, flexible decision boundaries, effective with sufficient data, supports multi-class problems.

Limitations

High computational cost at prediction, sensitive to irrelevant/noisy features, requires careful feature scaling, suffers from curse of dimensionality.

Mitigation Strategies

Use dimensionality reduction, feature selection, approximate nearest neighbor algorithms, optimized data structures like KD-trees or ball trees.

Computational Complexity

Training Phase

O(1) or O(n) for storing dataset; no explicit model building.

Prediction Phase

O(n * d) per query (n = training size, d = dimensionality) for naive search. Accelerated methods reduce this.

Data Structures for Efficiency

KD-trees: efficient for low dimensional data. Ball trees and cover trees for higher dimensions. Approximate nearest neighbor reduces complexity further.

Applications

Pattern Recognition

Handwriting recognition, image classification, speech recognition.

Medical Diagnosis

Predict disease presence, prognosis based on patient features.

Recommendation Systems

User-based collaborative filtering, item similarity detection.

Anomaly Detection

Identify outliers by distance from normal data clusters.

Implementation Details

Data Storage

Stores entire training set in memory. Memory usage scales linearly with dataset size and dimensionality.

Algorithm Optimization

Use efficient nearest neighbor search algorithms, early stopping criteria, dimensionality reduction to speed up predictions.

Software Libraries

Available in scikit-learn (Python), Weka (Java), MATLAB, R (class package), TensorFlow addons.

Extensions and Variants

Radius Neighbors

Considers all neighbors within a fixed radius instead of fixed k.

Weighted k-NN

Weights neighbors by inverse distance or kernel functions.

Condensed Nearest Neighbor

Reduces dataset size by removing redundant points.

Fuzzy k-NN

Assigns membership probabilities rather than hard labels.

Adaptive k-NN

Vary k dynamically based on local data density.

References

  • T. M. Cover and P. E. Hart, "Nearest neighbor pattern classification," IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 21-27, 1967.
  • R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed., Wiley-Interscience, 2000.
  • L. I. Kuncheva, "Nearest neighbor classifier: Simplicity and effectiveness," Pattern Recognition Letters, vol. 31, no. 6, pp. 654-658, 2010.
  • C. E. Brodley and M. A. Friedl, "Identifying mislabeled training data," Journal of Artificial Intelligence Research, vol. 11, pp. 131-167, 1999.
  • J. Wang, J. Yang, J. L. Huang, and Y. Zhang, "A review on nearest neighbor algorithms," Neural Computing and Applications, vol. 32, no. 4, pp. 1043-1066, 2020.