Introduction
Unsupervised learning discovers hidden structure in unlabeled data. Without target labels, the goal is to find natural groupings, reduce dimensionality, or identify anomalies. Real-world data often lacks labels; unsupervised methods extract value from raw data.
Applications abound: customer segmentation (cluster customers by behavior), gene sequencing (cluster similar genes), anomaly detection (fraud, intrusions), feature extraction (reduce high-dimensional data).
Challenge: without labels, evaluation is subjective. Different algorithms may produce different clusterings, all potentially valid. Requires domain knowledge to assess meaningfulness.
"Most of the world's data is unlabeled. Unsupervised learning taps into this vast resource, discovering patterns humans haven't even defined." -- Yann LeCun
Clustering Algorithms
Goal
Partition data into K clusters where samples in same cluster similar, samples in different clusters dissimilar. "Similar" and "dissimilar" defined by distance metric.
Distance Metrics
Euclidean: sqrt(sum((x_i - y_i)^2))
Manhattan: sum(|x_i - y_i|)
Cosine: 1 - (x . y) / (||x|| ||y||)
Correlation: based on correlation coefficient
Challenges
- Determining K: Number of clusters unknown. Elbow method, silhouette analysis help.
- Local minima: Different initializations may converge to different clusterings.
- Scalability: Some algorithms O(n^2) or O(n^3); slow for large n.
- Non-spherical clusters: Some algorithms assume spherical, fail on elongated, arbitrary shapes.
K-Means and Partitioning Methods
K-Means Algorithm
1. Initialize K cluster centers randomly
2. Assign each point to nearest center
3. Recompute centers as mean of assigned points
4. Repeat until convergence
Properties
Advantages: Fast (linear in n), simple, interpretable, works well when clusters globular.
Disadvantages: Must specify K, sensitive to initialization, poor with non-convex clusters, assumes equal-sized clusters.
Initialization Strategies
K-means++: Smart initialization: choose first center randomly, subsequent centers far from existing centers. Reduces local minima.
K-Medoids
Similar to K-means but centers are actual data points (medoids) not means. More robust to outliers but slower.
Fuzzy C-Means
Soft clustering: each point has membership degree (0-1) for each cluster. Enables partial membership, uncertainty. Generalizes K-means.
Hierarchical Clustering
Agglomerative (Bottom-Up)
1. Each point starts as own cluster
2. Repeatedly merge closest two clusters
3. Continue until single cluster remains
4. Cut dendrogram at desired height for final clusters
Linkage Criteria
| Linkage | Distance Measure | Effect |
|---|---|---|
| Single | Min distance between clusters | Tends to chain, elongated clusters |
| Complete | Max distance between clusters | Compact clusters, balanced |
| Average | Average distance between clusters | Balanced, less affected by outliers |
Divisive (Top-Down)
Start with single cluster, recursively split. Rare; computationally expensive.
Advantages/Disadvantages
Advantages: Don't specify K (choose later from dendrogram), hierarchical structure provides insights.
Disadvantages: O(n^2) memory and O(n^2 log n) time, irreversible (bad early merge can't be undone).
Density-Based Clustering (DBSCAN)
Key Idea
Clusters are dense regions separated by sparse regions. Points in dense regions belonging to cluster; isolated points are outliers.
Algorithm
1. For each unvisited point:
a. Find all neighbors within distance epsilon
b. If >= min_points neighbors: start cluster
c. Recursively expand cluster to all density-reachable points
2. Points not in any cluster are outliers
Advantages
- Discovers arbitrary-shaped clusters
- Identifies outliers
- Doesn't require specifying K
Disadvantages
- Hyperparameter sensitive (epsilon, min_points)
- O(n^2) worst case
- Struggles with varying density clusters
Dimensionality Reduction
Motivation
High-dimensional data (thousands of features) problematic: curse of dimensionality, computational cost, visualization difficulty, noise overwhelms signal. Reduce to essential dimensions.
Goals
Information Preservation: Retain important variance.
Noise Reduction: Remove noisy dimensions.
Visualization: Project to 2D/3D for plotting.
Downstream Task: Improve classification, clustering.
Principal Component Analysis (PCA)
Algorithm
1. Center data (subtract mean)
2. Compute covariance matrix
3. Find eigenvectors (principal components)
4. Project data onto top K eigenvectors
Result: K-dimensional representation preserving most variance
Properties
Orthogonal components: PCs uncorrelated.
Variance explained: k-th PC explains proportion of variance. Choose K to retain 90-95% variance.
Interpretability: Each PC linear combination of original features; less interpretable than original.
Limitations
Linear method; assumes linear relationships. Non-linear data (e.g., curves, manifolds) need non-linear methods.
Manifold Learning
Assumption
High-dimensional data lies on lower-dimensional manifold (curved surface). Unfold manifold to recover low-dimensional structure.
t-SNE
t-Distributed Stochastic Neighbor Embedding. Preserves local structure: nearby points stay nearby, distant points far. Excellent for visualization.
Drawback: Expensive O(n^2), doesn't scale to large datasets, results vary with random initialization.
UMAP
Uniform Manifold Approximation and Projection. Similar to t-SNE but faster O(n), preserves both local and global structure better.
Isomap
Uses geodesic distances (along manifold) instead of Euclidean. Recovers low-dimensional structure if manifold smooth.
Autoencoders and Representation Learning
Architecture
Input -> Encoder -> Bottleneck (latent) -> Decoder -> Output
Goal: reconstruct input. Bottleneck forces dimensionality reduction.
Training
Minimize reconstruction error: ||input - output||^2. Learned bottleneck representation is compressed version of input.
Variants
Variational Autoencoder (VAE): Bottleneck samples from learned distribution. Enables generation of new samples.
Denoising Autoencoder: Input corrupted with noise; learn to reconstruct clean. Learns robust representations.
Contractive Autoencoder: Penalizes large changes in hidden representation for small input changes. Extracts stable features.
Applications
Feature extraction, dimensionality reduction, anomaly detection (reconstruction error), generative modeling.
Anomaly Detection
Approaches
Distance-based: Points far from others are anomalies. Local Outlier Factor (LOF).
Reconstruction-based: High reconstruction error in autoencoder indicates anomaly.
Statistical: Fit distribution; low-probability points are anomalies.
Isolation Forest: Anomalies isolated easier than normal points in random forests.
Challenges
Unlabeled anomalies hard to evaluate. Imbalanced (few anomalies). Context-dependent (anomaly in one context normal in another).
Evaluation Metrics for Unsupervised Learning
Internal Metrics (no labels)
Silhouette Score: Measures how similar point is to own cluster vs. others. Range [-1, 1]; high is good.
Davies-Bouldin Index: Ratio of within to between cluster distances. Lower is better.
Calinski-Harabasz Index: Ratio of between-cluster to within-cluster variance. Higher is better.
External Metrics (ground truth available)
Adjusted Rand Index (ARI): Similarity between predicted and true clustering. Range [-1, 1]; 1 is perfect.
Normalized Mutual Information (NMI): Information shared between predicted and true. Range [0, 1].
Domain Knowledge
Ultimately, assess meaningfulness by domain expert: "Do the clusters make sense?"
Applications
Customer Segmentation
Cluster customers by behavior, demographics. Target marketing, personalize recommendations.
Gene Sequencing
Cluster genes by expression patterns. Discover functionally related genes.
Anomaly Detection
Fraud detection (unusual transactions), intrusion detection (unusual network patterns), medical diagnosis (abnormal test results).
Document Clustering
Organize documents by topic. Group similar articles, reduce search space.
Image Compression
K-means clustering on pixel values. Reduce colors in image (e.g., 256 to 8 colors).
References
- MacQueen, J. "Some Methods for Classification and Analysis of Multivariate Observations." Proc. 5th Berkeley Symp. Math. Statist. Prob., 1967.
- Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. "A Density-Based Algorithm for Discovering Clusters." KDD, 1996.
- Jolliffe, I. T. "Principal Component Analysis." Springer, 2nd edition, 2002.
- van der Maaten, L., and Hinton, G. E. "Visualizing Data using t-SNE." JMLR, 2008.
- McInnes, L., Healy, J., and Melville, J. "UMAP: Uniform Manifold Approximation and Projection." JOSS, 2018.