Introduction

Unsupervised learning discovers hidden structure in unlabeled data. Without target labels, the goal is to find natural groupings, reduce dimensionality, or identify anomalies. Real-world data often lacks labels; unsupervised methods extract value from raw data.

Applications abound: customer segmentation (cluster customers by behavior), gene sequencing (cluster similar genes), anomaly detection (fraud, intrusions), feature extraction (reduce high-dimensional data).

Challenge: without labels, evaluation is subjective. Different algorithms may produce different clusterings, all potentially valid. Requires domain knowledge to assess meaningfulness.

"Most of the world's data is unlabeled. Unsupervised learning taps into this vast resource, discovering patterns humans haven't even defined." -- Yann LeCun

Clustering Algorithms

Goal

Partition data into K clusters where samples in same cluster similar, samples in different clusters dissimilar. "Similar" and "dissimilar" defined by distance metric.

Distance Metrics

Euclidean: sqrt(sum((x_i - y_i)^2))
Manhattan: sum(|x_i - y_i|)
Cosine: 1 - (x . y) / (||x|| ||y||)
Correlation: based on correlation coefficient

Challenges

  • Determining K: Number of clusters unknown. Elbow method, silhouette analysis help.
  • Local minima: Different initializations may converge to different clusterings.
  • Scalability: Some algorithms O(n^2) or O(n^3); slow for large n.
  • Non-spherical clusters: Some algorithms assume spherical, fail on elongated, arbitrary shapes.

K-Means and Partitioning Methods

K-Means Algorithm

1. Initialize K cluster centers randomly
2. Assign each point to nearest center
3. Recompute centers as mean of assigned points
4. Repeat until convergence

Properties

Advantages: Fast (linear in n), simple, interpretable, works well when clusters globular.

Disadvantages: Must specify K, sensitive to initialization, poor with non-convex clusters, assumes equal-sized clusters.

Initialization Strategies

K-means++: Smart initialization: choose first center randomly, subsequent centers far from existing centers. Reduces local minima.

K-Medoids

Similar to K-means but centers are actual data points (medoids) not means. More robust to outliers but slower.

Fuzzy C-Means

Soft clustering: each point has membership degree (0-1) for each cluster. Enables partial membership, uncertainty. Generalizes K-means.

Hierarchical Clustering

Agglomerative (Bottom-Up)

1. Each point starts as own cluster
2. Repeatedly merge closest two clusters
3. Continue until single cluster remains
4. Cut dendrogram at desired height for final clusters

Linkage Criteria

Linkage Distance Measure Effect
Single Min distance between clusters Tends to chain, elongated clusters
Complete Max distance between clusters Compact clusters, balanced
Average Average distance between clusters Balanced, less affected by outliers

Divisive (Top-Down)

Start with single cluster, recursively split. Rare; computationally expensive.

Advantages/Disadvantages

Advantages: Don't specify K (choose later from dendrogram), hierarchical structure provides insights.

Disadvantages: O(n^2) memory and O(n^2 log n) time, irreversible (bad early merge can't be undone).

Density-Based Clustering (DBSCAN)

Key Idea

Clusters are dense regions separated by sparse regions. Points in dense regions belonging to cluster; isolated points are outliers.

Algorithm

1. For each unvisited point:
 a. Find all neighbors within distance epsilon
 b. If >= min_points neighbors: start cluster
 c. Recursively expand cluster to all density-reachable points
2. Points not in any cluster are outliers

Advantages

  • Discovers arbitrary-shaped clusters
  • Identifies outliers
  • Doesn't require specifying K

Disadvantages

  • Hyperparameter sensitive (epsilon, min_points)
  • O(n^2) worst case
  • Struggles with varying density clusters

Dimensionality Reduction

Motivation

High-dimensional data (thousands of features) problematic: curse of dimensionality, computational cost, visualization difficulty, noise overwhelms signal. Reduce to essential dimensions.

Goals

Information Preservation: Retain important variance.

Noise Reduction: Remove noisy dimensions.

Visualization: Project to 2D/3D for plotting.

Downstream Task: Improve classification, clustering.

Principal Component Analysis (PCA)

Algorithm

1. Center data (subtract mean)
2. Compute covariance matrix
3. Find eigenvectors (principal components)
4. Project data onto top K eigenvectors
Result: K-dimensional representation preserving most variance

Properties

Orthogonal components: PCs uncorrelated.

Variance explained: k-th PC explains proportion of variance. Choose K to retain 90-95% variance.

Interpretability: Each PC linear combination of original features; less interpretable than original.

Limitations

Linear method; assumes linear relationships. Non-linear data (e.g., curves, manifolds) need non-linear methods.

Manifold Learning

Assumption

High-dimensional data lies on lower-dimensional manifold (curved surface). Unfold manifold to recover low-dimensional structure.

t-SNE

t-Distributed Stochastic Neighbor Embedding. Preserves local structure: nearby points stay nearby, distant points far. Excellent for visualization.

Drawback: Expensive O(n^2), doesn't scale to large datasets, results vary with random initialization.

UMAP

Uniform Manifold Approximation and Projection. Similar to t-SNE but faster O(n), preserves both local and global structure better.

Isomap

Uses geodesic distances (along manifold) instead of Euclidean. Recovers low-dimensional structure if manifold smooth.

Autoencoders and Representation Learning

Architecture

Input -> Encoder -> Bottleneck (latent) -> Decoder -> Output

Goal: reconstruct input. Bottleneck forces dimensionality reduction.

Training

Minimize reconstruction error: ||input - output||^2. Learned bottleneck representation is compressed version of input.

Variants

Variational Autoencoder (VAE): Bottleneck samples from learned distribution. Enables generation of new samples.

Denoising Autoencoder: Input corrupted with noise; learn to reconstruct clean. Learns robust representations.

Contractive Autoencoder: Penalizes large changes in hidden representation for small input changes. Extracts stable features.

Applications

Feature extraction, dimensionality reduction, anomaly detection (reconstruction error), generative modeling.

Anomaly Detection

Approaches

Distance-based: Points far from others are anomalies. Local Outlier Factor (LOF).

Reconstruction-based: High reconstruction error in autoencoder indicates anomaly.

Statistical: Fit distribution; low-probability points are anomalies.

Isolation Forest: Anomalies isolated easier than normal points in random forests.

Challenges

Unlabeled anomalies hard to evaluate. Imbalanced (few anomalies). Context-dependent (anomaly in one context normal in another).

Evaluation Metrics for Unsupervised Learning

Internal Metrics (no labels)

Silhouette Score: Measures how similar point is to own cluster vs. others. Range [-1, 1]; high is good.

Davies-Bouldin Index: Ratio of within to between cluster distances. Lower is better.

Calinski-Harabasz Index: Ratio of between-cluster to within-cluster variance. Higher is better.

External Metrics (ground truth available)

Adjusted Rand Index (ARI): Similarity between predicted and true clustering. Range [-1, 1]; 1 is perfect.

Normalized Mutual Information (NMI): Information shared between predicted and true. Range [0, 1].

Domain Knowledge

Ultimately, assess meaningfulness by domain expert: "Do the clusters make sense?"

Applications

Customer Segmentation

Cluster customers by behavior, demographics. Target marketing, personalize recommendations.

Gene Sequencing

Cluster genes by expression patterns. Discover functionally related genes.

Anomaly Detection

Fraud detection (unusual transactions), intrusion detection (unusual network patterns), medical diagnosis (abnormal test results).

Document Clustering

Organize documents by topic. Group similar articles, reduce search space.

Image Compression

K-means clustering on pixel values. Reduce colors in image (e.g., 256 to 8 colors).

References

  • MacQueen, J. "Some Methods for Classification and Analysis of Multivariate Observations." Proc. 5th Berkeley Symp. Math. Statist. Prob., 1967.
  • Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. "A Density-Based Algorithm for Discovering Clusters." KDD, 1996.
  • Jolliffe, I. T. "Principal Component Analysis." Springer, 2nd edition, 2002.
  • van der Maaten, L., and Hinton, G. E. "Visualizing Data using t-SNE." JMLR, 2008.
  • McInnes, L., Healy, J., and Melville, J. "UMAP: Uniform Manifold Approximation and Projection." JOSS, 2018.