Introduction
Deep learning refers to neural networks with multiple hidden layers. The "deep" aspect—multiple layers—enables learning hierarchical representations: early layers learn simple patterns (edges, textures), middle layers combine these into complex patterns (shapes, objects), deep layers learn semantic concepts (categories, meanings).
Deep learning revolution began ~2012 with AlexNet winning ImageNet challenge. Since then: unprecedented progress in computer vision, natural language processing, game playing, scientific discovery. Modern large language models (GPT-3, GPT-4) demonstrate emergent capabilities from scaling.
Core insight: with sufficient data and computation, deep networks learn better representations than hand-crafted features. Backpropagation efficiently trains these networks. GPU acceleration makes training feasible.
"Deep learning is the most powerful tool we have for learning from data. Multiple layers enable learning of abstractions that no shallow method can capture." -- Yann LeCun, Facebook AI Research
Neural Network Foundations
Neuron Model
output = activation(sum(w_i * input_i) + b)
Weighted sum of inputs plus bias, passed through non-linearity
Mimics biological neuron's firing behavior
Layer Structure
Input Layer: raw features
Hidden Layers: learned representations
Output Layer: predictions
Each layer: linear transformation (matrix multiplication) + non-linearity
Activation Functions
| Function | Formula | Properties |
|---|---|---|
| ReLU | max(0, x) | Fast, popular, enables deep networks |
| Sigmoid | 1/(1+e^-x) | Smooth, probabilistic output, vanishing gradient |
| Tanh | (e^x - e^-x)/(e^x + e^-x) | Smooth, zero-centered, vanishing gradient |
| Softmax | e^x_i / sum(e^x_j) | Output probability distribution, classification |
Universal Approximation
Theorem: single-layer neural network with sufficient hidden units can approximate any continuous function. Multi-layer networks: same theoretical capacity but learn efficiently (hierarchical structure reduces parameters needed).
Backpropagation and Training
Core Algorithm
Forward Pass: compute predictions through all layers
Loss Computation: measure error
Backward Pass: compute gradients of loss w.r.t. all weights (chain rule)
Update: weights -= learning_rate * gradients
Chain Rule for Gradients
Backpropagation applies chain rule from output layer to input layer. Each layer's gradient depends on upstream (toward output) gradient multiplied by local gradient.
Computational Efficiency
Backpropagation: O(N) computation for N weights (one forward pass + one backward pass). Naive approach: O(N^2) computing each gradient separately. Dynamic programming reuses intermediate values.
Gradient Flow
Vanishing Gradients: In deep networks, gradients through early layers exponentially small (product of many small numbers). Limits deep network training.
Exploding Gradients: Conversely, gradients can grow exponentially. Solution: gradient clipping.
Solutions: ReLU activation (avoids saturation), batch normalization, skip connections (ResNets).
Deep Network Architectures
Architecture Components
- Dense/Fully-Connected Layers: All inputs connected to all outputs. High parameter count.
- Convolutional Layers: Local connectivity, weight sharing. Efficient for spatial data (images).
- Recurrent Layers: Connections across time. For sequential data.
- Attention Mechanisms: Weight important features. Enables long-range dependencies.
- Normalization: Batch norm, layer norm. Stabilizes training.
Modern Architectures
ResNet: Skip connections. Enables very deep networks (50-152 layers).
Inception/GoogleNet: Multi-scale parallel branches. Efficient feature extraction.
DenseNet: Dense connections (each layer inputs from all previous). Feature reuse, efficient gradient flow.
EfficientNet: Compound scaling of depth, width, resolution. Optimal accuracy-efficiency trade-off.
Vision Transformer: Applies transformer architecture to images. Patches as sequence. State-of-the-art on large datasets.
Convolutional Neural Networks (CNNs)
Convolutional Layer
Applies learnable filters to local patches of input. Filters detect patterns (edges, corners, textures). Parameter sharing across spatial positions.
3x3 filter applied to every 3x3 patch in input
Same filter detects patterns regardless of position
Output: feature map showing where patterns occur
Pooling
Downsampling: max pooling takes maximum in local window. Reduces dimensionality, provides translation invariance.
CNN Architecture Flow
Input Image
-> Conv layer (detect features)
-> ReLU (non-linearity)
-> Pool (downsample)
-> Conv layer (detect combinations)
-> Pool
-> ...
-> Fully-connected layers (classification)
Why CNNs Work for Images
- Local connectivity matches image locality (nearby pixels related)
- Weight sharing captures translation invariance (same object anywhere)
- Hierarchy: low layers detect primitives, high layers detect objects
Recurrent Neural Networks (RNNs)
Sequence Processing
Process sequences (text, audio, time series) one element at a time. Hidden state carries information from previous elements.
h_t = activation(W_h * h_{t-1} + W_x * x_t + b)
output_t = W_o * h_t + b_o
Hidden state h_t summarizes past sequence
LSTM and GRU
LSTM (Long Short-Term Memory): Gating mechanisms control information flow. Solves vanishing gradients. Can remember long-range dependencies.
GRU (Gated Recurrent Unit): Simplified LSTM. Same benefits, fewer parameters.
Bidirectional Processing
Process sequence forward and backward, combine outputs. Captures full context (both past and future).
Transformer Architecture
Self-Attention
Core mechanism: each element attends to all other elements. Learns which elements to focus on.
Attention(Q, K, V) = softmax(QK^T / sqrt(d)) V
Q (Query): what am I looking for?
K (Key): what information do I have?
V (Value): what information to use?
Transformer Layer
Multi-head self-attention: parallel attention mechanisms
Feed-forward network: two dense layers with non-linearity
Residual connections: skip connections enable deep stacking
Layer normalization: stabilizes training
Why Transformers Dominate
- Parallelizable training (unlike RNNs which are sequential)
- Captures long-range dependencies well (no vanishing gradient)
- Flexible to different modalities (text, images, audio)
Optimization Techniques
Learning Rate Scheduling
Start with high learning rate (fast initial progress), decay during training (fine-tuning). Cosine annealing, step decay common.
Batch Normalization
Normalize layer inputs to zero-mean, unit-variance. Stabilizes training, allows higher learning rates, acts as regularizer.
Gradient Accumulation
Simulate larger batch size with limited memory: backprop through mini-batches without updating, then update after K mini-batches.
Mixed Precision Training
Use float16 for forward/backward passes (faster, less memory), float32 for weight updates (more stable). Speedup 2-3x with minimal accuracy loss.
Distributed Training
Train on multiple GPUs/TPUs. Data parallelism: different GPUs process different data, synchronize gradients. Enables larger batch sizes, faster training.
Regularization in Deep Learning
Dropout
Randomly zero activations during training (probability p, typically 0.5). Prevents co-adaptation. At test time, scale activations by (1-p).
Data Augmentation
Transform training data (rotations, crops, color shifts). Increases effective dataset size. Critical for limited data.
Early Stopping
Monitor validation loss. Stop training when validation loss plateaus/increases. Prevents overfitting without explicit regularizer.
Weight Decay
Add L2 penalty to loss (or equivalent: decay weights each update). Prevents large weights.
Label Smoothing
Soften hard targets (1, 0) to soft targets (0.9, 0.1). Prevents overconfidence, improves generalization.
Applications and Breakthroughs
Computer Vision
Image classification, object detection, semantic segmentation. Accuracy exceeds human performance on many tasks.
Natural Language Processing
Machine translation, question answering, text generation. Large language models (GPT) achieve remarkable language understanding.
Game Playing
AlphaGo (CNNs + MCTS) defeated world Go champion. AlphaZero (self-play RL) learned chess, shogi, Go from scratch.
Scientific Discovery
AlphaFold predicts 3D protein structures from amino acid sequences. Drug discovery, materials design, physics simulations.
Autonomous Systems
Self-driving cars, robotics. Deep learning processes sensor data; reinforcement learning learns control policies.
Challenges and Limitations
Data Efficiency
Deep networks need massive training data. Hundreds of thousands to millions of examples. Limited by annotation cost, privacy, availability.
Interpretability
Deep networks are "black boxes"—difficult to understand why predictions made. Attention mechanisms and saliency maps provide some interpretability.
Robustness
Adversarial examples: tiny perturbations fool networks. Adversarial training, certified defenses help but incomplete.
Computational Cost
Training large models requires massive GPU/TPU resources, energy consumption. Environmental cost, accessibility barrier.
Generalization
Models often don't transfer well across domains. Distribution shift causes performance degradation.
Overfitting
Many parameters, complex models easily overfit. Requires careful regularization, validation.
References
- Goodfellow, I., Bengio, Y., and Courville, A. "Deep Learning." MIT Press, 2016.
- Krizhevsky, A., Sutskever, I., and Hinton, G. E. "ImageNet Classification with Deep Convolutional Neural Networks." NIPS, 2012.
- He, K., Zhang, X., Ren, S., and Sun, J. "Deep Residual Learning for Image Recognition." IEEE CVPR, 2015.
- Vaswani, A., Shazeer, N., Parmar, N., et al. "Attention Is All You Need." NIPS, 2017.
- Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. "An Image is Worth 16x16 Words." ICLR, 2021.