Introduction
A language model assigns probability to sequences of words. Core task: given words w1, w2, ..., w_{n-1}, predict w_n. P(w_n | w1, w2, ..., w_{n-1}). Enables text generation, evaluation, machine translation, speech recognition.
Historical evolution: n-gram models (simple, efficient), neural models (more expressive), RNNs/LSTMs (handle longer context), transformers (parallel training, long-range dependencies). Modern LLMs (GPT-3, GPT-4, LLaMA): hundreds of billions parameters, trained on trillions of tokens.
Language models capture linguistic structure: syntax, semantics, pragmatics, world knowledge. Emergent capabilities: when sufficiently large, models demonstrate reasoning, few-shot learning, instruction-following without explicit training.
"Language modeling is perhaps the most important task in NLP. Nearly all modern NLP systems build on language models as foundation." -- Yoshua Bengio, Machine Learning pioneer
Problem Definition and Foundations
Probabilistic Formulation
Language model: probability distribution P(w_1, w_2, ..., w_n) over word sequences. Using chain rule:
P(w1, w2, ..., wn) = P(w1) * P(w2|w1) * P(w3|w1,w2) * ... * P(wn|w1,...,w_{n-1})
Each factor P(w_i | w1,...,w_{i-1}) is conditional probability of word i given history.
Objective: Minimizing Perplexity
Perplexity measures model performance: PP = exp(-1/N * sum log P(w_i | history)). Lower better. Intuitive: average branching factor, how surprised model is by held-out text.
PP = exp(L) where L = cross-entropy loss
If model perfectly predicts: perplexity = 1
If model random (uniform): perplexity = vocabulary size
Markov Assumption
Exact history intractable. Markov assumption: P(w_i | w1,...,w_{i-1}) ≈ P(w_i | w_{i-k}, ..., w_{i-1}). Only last k words matter (k-th order Markov). Enables tractable models.
N-Gram Language Models
Bigram Models (k=2)
P(w_i | w_{i-1}). Model assumes each word depends only on previous word. Count occurrences in training data:
P(w_i | w_{i-1}) = count(w_{i-1}, w_i) / count(w_{i-1})
Example: "the cat sat"
P(cat | the) = count("the cat") / count("the")
P(sat | cat) = count("cat sat") / count("cat")
Higher-Order N-Grams
Trigram (k=3): P(w_i | w_{i-2}, w_{i-1}). 4-gram, 5-gram capture longer dependencies. Tradeoff: higher k improves quality (longer context) but requires exponentially more data, sparse counts.
Sparsity Problem
Rare n-grams have count=0 → probability 0 (incorrect). Solution: smoothing. Laplace smoothing (add-one): assume all n-grams occurred once. Backoff: if trigram unseen, use bigram probability. Kneser-Ney: sophisticated smoothing using lower-order statistics.
Laplace smoothing:
P(w_i | context) = (count(context, w_i) + 1) / (count(context) + |V|)
Backoff (trigram unseen):
P(w_i | w_{i-2}, w_{i-1}) ≈ P(w_i | w_{i-1})
Advantages and Limitations
| Aspect | N-Gram | Neural |
|---|---|---|
| Training Speed | Fast (count) | Slow (backprop) |
| Context Length | Fixed, short (2-5) | Flexible (100s-1000s) |
| Parameter Sharing | None | Yes (embeddings, weights) |
| Memory Usage | Huge (all n-grams) | Moderate (parameters) |
| Perplexity (modern text) | 50-100 | 20-40 |
Neural Language Models
Distributed Representations
Key insight: embed words in continuous space. Semantically similar words = nearby embeddings. Neural network learns embeddings jointly with language modeling.
Architecture:
Input: word IDs [w_{i-k}, ..., w_{i-1}]
Embedding: dense vectors [e_{w_{i-k}}, ..., e_{w_{i-1}}]
Hidden: h = tanh(W * concat([e_{w_{i-k}}, ..., e_{w_{i-1}}]) + b)
Output: P(w_i | context) = softmax(V * h)
Learns embeddings + hidden weights jointly
Word Embeddings
Embeddings capture semantic relationships. Word2vec (Skip-gram, CBOW) explicit embedding training. Modern: embeddings learned as byproduct of LM training. Nearby embeddings: "king" near "queen", "prince", "king-man+woman≈queen".
Advantages Over N-Gram
- Parameter Sharing: Similar words share embeddings. Rare words benefit from similar word statistics.
- Longer Context: Hidden layer captures context dependency. Works with longer sequences than n-gram feasible.
- Generalization: Learned representations generalize. Small context never seen transfers via embeddings.
Recurrent Neural Network Models
RNN-Based Language Model
Process sequences of arbitrary length. Hidden state h_t carries information from previous time steps:
h_t = tanh(W_h * h_{t-1} + W_x * x_t + b)
P(w_t | history) = softmax(W_o * h_t + b_o)
Processes word by word, hidden state accumulates context
LSTM and GRU
Vanishing gradient problem: gradients through many time steps exponentially small. LSTM (Long Short-Term Memory) uses gating mechanisms (input, forget, output gates) to preserve gradients over long sequences. GRU (Gated Recurrent Unit): simplified LSTM, similar performance, fewer parameters.
Training RNN Language Models
Backpropagation through time (BPTT): unroll RNN for T steps, backprop through all steps. Truncated BPTT: limit to 50-100 steps (reduce computation). Gradient clipping: prevent exploding gradients.
State-of-the-Art RNN Models
AWD-LSTM (regularized LSTM): achieves strong results. Interlocked dropout, variational dropout, weight tying improves performance. Slower training than transformers but works well for smaller models.
Transformer-Based Language Models
Motivation
RNNs sequential: can't parallelize across time steps. Transformers replace recurrence with attention: all positions attend to all other positions in parallel. Enables massive parallelization, faster training, longer effective context.
Attention Mechanism
Given query Q, key K, value V, attention computes weighted sum of values:
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V
Softmax weights each position by relevance to query. Each position attends to relevant context.
Self-Attention in Language Models
Query, key, value all from same sequence. Position i attends to positions j < i (causal attention for autoregressive LM). Prevents cheating (looking at future).
Causal mask: positions only attend to past
Self-attention(i) = weighted sum of positions 1..i, where weights learned
Multi-Head Attention
Multiple attention heads in parallel (8-12 typically). Each head learns different attention pattern. Allows focusing on different aspects simultaneously.
Transformer Layer
Stack: multi-head self-attention → layer norm → feed-forward network → layer norm → residual connections. Modern architecture: pre-norm (norm before, not after layer). Enables deeper models.
Self-Attention and Multi-Head Attention
Scaled Dot-Product Attention
Scale factor 1/sqrt(d_k) prevents attention weights saturating when d_k large. Without scale: large dot products → softmax near 0 or 1 (extreme), small gradients.
Multi-Head Details
H heads, each projects to d_k-dimensional subspace. Compute attention separately, concatenate results. Enables modeling different semantic relationships in parallel.
MultiHead(Q, K, V) = Concat(head_1, ..., head_h) * W_o
head_i = Attention(Q*W_i^Q, K*W_i^K, V*W_i^V)
Each head learns separate projection and attention pattern
Positional Encoding
Attention mechanism permutation-invariant: position-agnostic. Add positional encodings to input embeddings. Sinusoidal positional encoding: PE_{pos, 2i} = sin(pos / 10000^{2i/d}), PE_{pos, 2i+1} = cos(...). Enables learning position-dependent relationships.
GPT Architecture and Autoregressive Generation
GPT Design
Decoder-only transformer. Autoregressive: predict next token given previous. Causal self-attention masks future (can't cheat). Fine-tune on downstream tasks by adding task-specific head.
Generation Process
Given prompt, generate token-by-token. At each step, output probability distribution over vocabulary. Select next token (greedy, sampling, top-k, top-p). Append to context, repeat.
Prompt: "The capital of France is"
Step 1: P(w | "The capital of France is") → sample/argmax
Step 2: P(w | "The capital of France is Paris") → sample/argmax
Step 3: P(w | "The capital of France is Paris is") → sample/argmax
...continues until stop token or max length
Scaling Laws
Model performance improves predictably with scale: parameters, data, compute. GPT-3 (175B params) vs. GPT-2 (1.5B): massive improvement. Emergent abilities appear: few-shot learning, instruction following emerge at scale, not present in smaller models.
In-Context Learning
GPT-3: few-shot learning via examples in prompt. No gradient update, only forward pass. "Show me 3 examples, now do this task." Enables rapid adaptation.
BERT Architecture and Bidirectional Context
Bidirectional Context
BERT unlike GPT: no causal masking. Each position attends to all other positions. Bidirectional context helps understanding (classification). Sacrifices autoregressive generation (can't sample sequentially).
Pre-Training Objectives
Masked Language Modeling (MLM): Randomly mask 15% of tokens. Predict masked words from surrounding context. Forces learning bidirectional representations.
Next Sentence Prediction (NSP): Predict if two sentences consecutive in corpus. Helps with understanding relationships between sentences.
Input: [CLS] The cat sat [MASK] the mat [SEP] The dog [MASK] [SEP]
sentence A sentence B
MLM: Predict "on" at position 5, "ran" at position 8
NSP: Predict if B follows A in corpus
Fine-Tuning for Tasks
Add task-specific head on [CLS] token. Classification: linear layer → softmax. NER: linear layer per token. Question answering: predict start/end token positions. Minimal architecture changes, pre-trained weights do heavy lifting.
Pre-Training and Fine-Tuning
Pre-Training Scale
BERT: trained on 3.3B words (Books + Wikipedia). GPT-3: trained on 300B tokens. Massive unlabeled data availability. Pre-training expensive (weeks on 1000+ TPUs). But amortized over thousands of fine-tuning tasks.
Data Efficiency
Fine-tuning requires modest labeled data (100s to 10,000s examples). Without pre-training: need 100,000+ examples for comparable performance. Pre-training massive efficiency gain.
Fine-Tuning Strategies
- Feature Extraction: Freeze pre-trained weights, train task head. Fast, works when task similar to pre-training objective.
- Full Fine-tuning: Update all weights. Better final performance, requires more data and computation.
- Adapter Modules: Insert small trainable modules, freeze backbone. Parameter-efficient, avoids catastrophic forgetting.
- LoRA (Low-Rank Adaptation): Learn low-rank updates to weights. Minimal parameters, efficient, enables fast task switching.
Catastrophic Forgetting
Fine-tuning on new task erases pre-training knowledge. Mitigated: small learning rate, early stopping, regularization (L2 penalty on weights).
Decoding Strategies and Text Generation
Greedy Decoding
Select highest-probability token at each step. Fast, deterministic. Often poor quality (misses better sequences, gets stuck in loops).
Beam Search
Keep K best hypotheses (beams). Expand, prune to top K. Trade-off: K=1 is greedy, K=∞ is exhaustive. K=3-5 typical. Better quality than greedy, still tractable.
Sampling and Temperature
Sample from distribution (not argmax). Temperature T controls sharpness: high T (>1) smooth distribution, low T (<1) sharp. T=1 true distribution, T→0 greedy, T→∞ uniform.
P_T(w) = P(w)^{1/T} / sum_w' P(w')^{1/T}
High temperature: more diversity, risk incoherence
Low temperature: more confident, risk repetition
Top-K and Top-P (Nucleus Sampling)
Top-K: sample only from K highest-probability tokens. Top-P: sample from smallest set of tokens with cumulative probability ≥ P. Avoids extremely unlikely tokens. Better quality than raw sampling.
Stopping Criteria
Stop on end-of-sequence token, max length, or custom criteria. Avoid repetition: penalize repeated n-grams. Length penalty: normalize score by sequence length.
Applications and Impact
Machine Translation
Sequence-to-sequence: encoder reads source, decoder generates target. Attention attends to relevant source words. Transformers state-of-the-art. BLEU scores improved from 20s (SMT) to 30s+ (neural).
Text Generation
Story generation, summarization, paraphrasing. GPT-2 impressive generation despite simple pre-training objective. GPT-3 few-shot: generate by example. Applications: code generation, dialogue, creative writing.
Question Answering
SQuAD benchmark: machine reading comprehension. BERT fine-tuning achieves human-level performance. Downstream: customer service chatbots, FAQ systems.
Semantic Understanding
BERT embeddings used for sentence similarity, clustering, semantic search. Better than TF-IDF, word2vec for semantic tasks. Enable retrieval-augmented generation (find relevant context, generate answer).
Instruction Following
GPT-3 follows natural language instructions: "summarize the following in 3 bullet points", "write a poem about cats". Emergent ability from scale. InstructGPT fine-tuning improves alignment with user intent.
References
- Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. "A Neural Probabilistic Language Model." JMLR, vol. 3, 2003.
- Mikolov, T., Karafiát, M., Burget, L., et al. "Recurrent Neural Network based Language Model." INTERSPEECH, 2010.
- Vaswani, A., Shazeer, N., Parmar, N., et al. "Attention Is All You Need." NIPS, 2017.
- Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. "BERT: Pre-training of Deep Bidirectional Transformers." NAACL, 2019.
- Brown, T. B., Mann, B., Ryder, N., et al. "Language Models are Few-Shot Learners." NIPS, 2020.