Introduction

Deep learning refers to neural networks with multiple hidden layers. The "deep" aspect—multiple layers—enables learning hierarchical representations: early layers learn simple patterns (edges, textures), middle layers combine these into complex patterns (shapes, objects), deep layers learn semantic concepts (categories, meanings).

Deep learning revolution began ~2012 with AlexNet winning ImageNet challenge. Since then: unprecedented progress in computer vision, natural language processing, game playing, scientific discovery. Modern large language models (GPT-3, GPT-4) demonstrate emergent capabilities from scaling.

Core insight: with sufficient data and computation, deep networks learn better representations than hand-crafted features. Backpropagation efficiently trains these networks. GPU acceleration makes training feasible.

"Deep learning is the most powerful tool we have for learning from data. Multiple layers enable learning of abstractions that no shallow method can capture." -- Yann LeCun, Facebook AI Research

Neural Network Foundations

Neuron Model

output = activation(sum(w_i * input_i) + b)

Weighted sum of inputs plus bias, passed through non-linearity
Mimics biological neuron's firing behavior

Layer Structure

Input Layer: raw features
Hidden Layers: learned representations
Output Layer: predictions

Each layer: linear transformation (matrix multiplication) + non-linearity

Activation Functions

Function Formula Properties
ReLU max(0, x) Fast, popular, enables deep networks
Sigmoid 1/(1+e^-x) Smooth, probabilistic output, vanishing gradient
Tanh (e^x - e^-x)/(e^x + e^-x) Smooth, zero-centered, vanishing gradient
Softmax e^x_i / sum(e^x_j) Output probability distribution, classification

Universal Approximation

Theorem: single-layer neural network with sufficient hidden units can approximate any continuous function. Multi-layer networks: same theoretical capacity but learn efficiently (hierarchical structure reduces parameters needed).

Backpropagation and Training

Core Algorithm

Forward Pass: compute predictions through all layers
Loss Computation: measure error
Backward Pass: compute gradients of loss w.r.t. all weights (chain rule)
Update: weights -= learning_rate * gradients

Chain Rule for Gradients

Backpropagation applies chain rule from output layer to input layer. Each layer's gradient depends on upstream (toward output) gradient multiplied by local gradient.

Computational Efficiency

Backpropagation: O(N) computation for N weights (one forward pass + one backward pass). Naive approach: O(N^2) computing each gradient separately. Dynamic programming reuses intermediate values.

Gradient Flow

Vanishing Gradients: In deep networks, gradients through early layers exponentially small (product of many small numbers). Limits deep network training.

Exploding Gradients: Conversely, gradients can grow exponentially. Solution: gradient clipping.

Solutions: ReLU activation (avoids saturation), batch normalization, skip connections (ResNets).

Deep Network Architectures

Architecture Components

  • Dense/Fully-Connected Layers: All inputs connected to all outputs. High parameter count.
  • Convolutional Layers: Local connectivity, weight sharing. Efficient for spatial data (images).
  • Recurrent Layers: Connections across time. For sequential data.
  • Attention Mechanisms: Weight important features. Enables long-range dependencies.
  • Normalization: Batch norm, layer norm. Stabilizes training.

Modern Architectures

ResNet: Skip connections. Enables very deep networks (50-152 layers).

Inception/GoogleNet: Multi-scale parallel branches. Efficient feature extraction.

DenseNet: Dense connections (each layer inputs from all previous). Feature reuse, efficient gradient flow.

EfficientNet: Compound scaling of depth, width, resolution. Optimal accuracy-efficiency trade-off.

Vision Transformer: Applies transformer architecture to images. Patches as sequence. State-of-the-art on large datasets.

Convolutional Neural Networks (CNNs)

Convolutional Layer

Applies learnable filters to local patches of input. Filters detect patterns (edges, corners, textures). Parameter sharing across spatial positions.

3x3 filter applied to every 3x3 patch in input
Same filter detects patterns regardless of position
Output: feature map showing where patterns occur

Pooling

Downsampling: max pooling takes maximum in local window. Reduces dimensionality, provides translation invariance.

CNN Architecture Flow

Input Image
 -> Conv layer (detect features)
 -> ReLU (non-linearity)
 -> Pool (downsample)
 -> Conv layer (detect combinations)
 -> Pool
 -> ...
 -> Fully-connected layers (classification)

Why CNNs Work for Images

  • Local connectivity matches image locality (nearby pixels related)
  • Weight sharing captures translation invariance (same object anywhere)
  • Hierarchy: low layers detect primitives, high layers detect objects

Recurrent Neural Networks (RNNs)

Sequence Processing

Process sequences (text, audio, time series) one element at a time. Hidden state carries information from previous elements.

h_t = activation(W_h * h_{t-1} + W_x * x_t + b)
output_t = W_o * h_t + b_o

Hidden state h_t summarizes past sequence

LSTM and GRU

LSTM (Long Short-Term Memory): Gating mechanisms control information flow. Solves vanishing gradients. Can remember long-range dependencies.

GRU (Gated Recurrent Unit): Simplified LSTM. Same benefits, fewer parameters.

Bidirectional Processing

Process sequence forward and backward, combine outputs. Captures full context (both past and future).

Transformer Architecture

Self-Attention

Core mechanism: each element attends to all other elements. Learns which elements to focus on.

Attention(Q, K, V) = softmax(QK^T / sqrt(d)) V

Q (Query): what am I looking for?
K (Key): what information do I have?
V (Value): what information to use?

Transformer Layer

Multi-head self-attention: parallel attention mechanisms
Feed-forward network: two dense layers with non-linearity
Residual connections: skip connections enable deep stacking
Layer normalization: stabilizes training

Why Transformers Dominate

  • Parallelizable training (unlike RNNs which are sequential)
  • Captures long-range dependencies well (no vanishing gradient)
  • Flexible to different modalities (text, images, audio)

Optimization Techniques

Learning Rate Scheduling

Start with high learning rate (fast initial progress), decay during training (fine-tuning). Cosine annealing, step decay common.

Batch Normalization

Normalize layer inputs to zero-mean, unit-variance. Stabilizes training, allows higher learning rates, acts as regularizer.

Gradient Accumulation

Simulate larger batch size with limited memory: backprop through mini-batches without updating, then update after K mini-batches.

Mixed Precision Training

Use float16 for forward/backward passes (faster, less memory), float32 for weight updates (more stable). Speedup 2-3x with minimal accuracy loss.

Distributed Training

Train on multiple GPUs/TPUs. Data parallelism: different GPUs process different data, synchronize gradients. Enables larger batch sizes, faster training.

Regularization in Deep Learning

Dropout

Randomly zero activations during training (probability p, typically 0.5). Prevents co-adaptation. At test time, scale activations by (1-p).

Data Augmentation

Transform training data (rotations, crops, color shifts). Increases effective dataset size. Critical for limited data.

Early Stopping

Monitor validation loss. Stop training when validation loss plateaus/increases. Prevents overfitting without explicit regularizer.

Weight Decay

Add L2 penalty to loss (or equivalent: decay weights each update). Prevents large weights.

Label Smoothing

Soften hard targets (1, 0) to soft targets (0.9, 0.1). Prevents overconfidence, improves generalization.

Applications and Breakthroughs

Computer Vision

Image classification, object detection, semantic segmentation. Accuracy exceeds human performance on many tasks.

Natural Language Processing

Machine translation, question answering, text generation. Large language models (GPT) achieve remarkable language understanding.

Game Playing

AlphaGo (CNNs + MCTS) defeated world Go champion. AlphaZero (self-play RL) learned chess, shogi, Go from scratch.

Scientific Discovery

AlphaFold predicts 3D protein structures from amino acid sequences. Drug discovery, materials design, physics simulations.

Autonomous Systems

Self-driving cars, robotics. Deep learning processes sensor data; reinforcement learning learns control policies.

Challenges and Limitations

Data Efficiency

Deep networks need massive training data. Hundreds of thousands to millions of examples. Limited by annotation cost, privacy, availability.

Interpretability

Deep networks are "black boxes"—difficult to understand why predictions made. Attention mechanisms and saliency maps provide some interpretability.

Robustness

Adversarial examples: tiny perturbations fool networks. Adversarial training, certified defenses help but incomplete.

Computational Cost

Training large models requires massive GPU/TPU resources, energy consumption. Environmental cost, accessibility barrier.

Generalization

Models often don't transfer well across domains. Distribution shift causes performance degradation.

Overfitting

Many parameters, complex models easily overfit. Requires careful regularization, validation.

References

  • Goodfellow, I., Bengio, Y., and Courville, A. "Deep Learning." MIT Press, 2016.
  • Krizhevsky, A., Sutskever, I., and Hinton, G. E. "ImageNet Classification with Deep Convolutional Neural Networks." NIPS, 2012.
  • He, K., Zhang, X., Ren, S., and Sun, J. "Deep Residual Learning for Image Recognition." IEEE CVPR, 2015.
  • Vaswani, A., Shazeer, N., Parmar, N., et al. "Attention Is All You Need." NIPS, 2017.
  • Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. "An Image is Worth 16x16 Words." ICLR, 2021.