Introduction

A language model assigns probability to sequences of words. Core task: given words w1, w2, ..., w_{n-1}, predict w_n. P(w_n | w1, w2, ..., w_{n-1}). Enables text generation, evaluation, machine translation, speech recognition.

Historical evolution: n-gram models (simple, efficient), neural models (more expressive), RNNs/LSTMs (handle longer context), transformers (parallel training, long-range dependencies). Modern LLMs (GPT-3, GPT-4, LLaMA): hundreds of billions parameters, trained on trillions of tokens.

Language models capture linguistic structure: syntax, semantics, pragmatics, world knowledge. Emergent capabilities: when sufficiently large, models demonstrate reasoning, few-shot learning, instruction-following without explicit training.

"Language modeling is perhaps the most important task in NLP. Nearly all modern NLP systems build on language models as foundation." -- Yoshua Bengio, Machine Learning pioneer

Problem Definition and Foundations

Probabilistic Formulation

Language model: probability distribution P(w_1, w_2, ..., w_n) over word sequences. Using chain rule:

P(w1, w2, ..., wn) = P(w1) * P(w2|w1) * P(w3|w1,w2) * ... * P(wn|w1,...,w_{n-1})

Each factor P(w_i | w1,...,w_{i-1}) is conditional probability of word i given history.

Objective: Minimizing Perplexity

Perplexity measures model performance: PP = exp(-1/N * sum log P(w_i | history)). Lower better. Intuitive: average branching factor, how surprised model is by held-out text.

PP = exp(L) where L = cross-entropy loss
If model perfectly predicts: perplexity = 1
If model random (uniform): perplexity = vocabulary size

Markov Assumption

Exact history intractable. Markov assumption: P(w_i | w1,...,w_{i-1}) ≈ P(w_i | w_{i-k}, ..., w_{i-1}). Only last k words matter (k-th order Markov). Enables tractable models.

N-Gram Language Models

Bigram Models (k=2)

P(w_i | w_{i-1}). Model assumes each word depends only on previous word. Count occurrences in training data:

P(w_i | w_{i-1}) = count(w_{i-1}, w_i) / count(w_{i-1})

Example: "the cat sat"
P(cat | the) = count("the cat") / count("the")
P(sat | cat) = count("cat sat") / count("cat")

Higher-Order N-Grams

Trigram (k=3): P(w_i | w_{i-2}, w_{i-1}). 4-gram, 5-gram capture longer dependencies. Tradeoff: higher k improves quality (longer context) but requires exponentially more data, sparse counts.

Sparsity Problem

Rare n-grams have count=0 → probability 0 (incorrect). Solution: smoothing. Laplace smoothing (add-one): assume all n-grams occurred once. Backoff: if trigram unseen, use bigram probability. Kneser-Ney: sophisticated smoothing using lower-order statistics.

Laplace smoothing:
P(w_i | context) = (count(context, w_i) + 1) / (count(context) + |V|)

Backoff (trigram unseen):
P(w_i | w_{i-2}, w_{i-1}) ≈ P(w_i | w_{i-1})

Advantages and Limitations

Aspect N-Gram Neural
Training Speed Fast (count) Slow (backprop)
Context Length Fixed, short (2-5) Flexible (100s-1000s)
Parameter Sharing None Yes (embeddings, weights)
Memory Usage Huge (all n-grams) Moderate (parameters)
Perplexity (modern text) 50-100 20-40

Neural Language Models

Distributed Representations

Key insight: embed words in continuous space. Semantically similar words = nearby embeddings. Neural network learns embeddings jointly with language modeling.

Architecture:
Input: word IDs [w_{i-k}, ..., w_{i-1}]
Embedding: dense vectors [e_{w_{i-k}}, ..., e_{w_{i-1}}]
Hidden: h = tanh(W * concat([e_{w_{i-k}}, ..., e_{w_{i-1}}]) + b)
Output: P(w_i | context) = softmax(V * h)

Learns embeddings + hidden weights jointly

Word Embeddings

Embeddings capture semantic relationships. Word2vec (Skip-gram, CBOW) explicit embedding training. Modern: embeddings learned as byproduct of LM training. Nearby embeddings: "king" near "queen", "prince", "king-man+woman≈queen".

Advantages Over N-Gram

  • Parameter Sharing: Similar words share embeddings. Rare words benefit from similar word statistics.
  • Longer Context: Hidden layer captures context dependency. Works with longer sequences than n-gram feasible.
  • Generalization: Learned representations generalize. Small context never seen transfers via embeddings.

Recurrent Neural Network Models

RNN-Based Language Model

Process sequences of arbitrary length. Hidden state h_t carries information from previous time steps:

h_t = tanh(W_h * h_{t-1} + W_x * x_t + b)
P(w_t | history) = softmax(W_o * h_t + b_o)

Processes word by word, hidden state accumulates context

LSTM and GRU

Vanishing gradient problem: gradients through many time steps exponentially small. LSTM (Long Short-Term Memory) uses gating mechanisms (input, forget, output gates) to preserve gradients over long sequences. GRU (Gated Recurrent Unit): simplified LSTM, similar performance, fewer parameters.

Training RNN Language Models

Backpropagation through time (BPTT): unroll RNN for T steps, backprop through all steps. Truncated BPTT: limit to 50-100 steps (reduce computation). Gradient clipping: prevent exploding gradients.

State-of-the-Art RNN Models

AWD-LSTM (regularized LSTM): achieves strong results. Interlocked dropout, variational dropout, weight tying improves performance. Slower training than transformers but works well for smaller models.

Transformer-Based Language Models

Motivation

RNNs sequential: can't parallelize across time steps. Transformers replace recurrence with attention: all positions attend to all other positions in parallel. Enables massive parallelization, faster training, longer effective context.

Attention Mechanism

Given query Q, key K, value V, attention computes weighted sum of values:

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V

Softmax weights each position by relevance to query. Each position attends to relevant context.

Self-Attention in Language Models

Query, key, value all from same sequence. Position i attends to positions j < i (causal attention for autoregressive LM). Prevents cheating (looking at future).

Causal mask: positions only attend to past
Self-attention(i) = weighted sum of positions 1..i, where weights learned

Multi-Head Attention

Multiple attention heads in parallel (8-12 typically). Each head learns different attention pattern. Allows focusing on different aspects simultaneously.

Transformer Layer

Stack: multi-head self-attention → layer norm → feed-forward network → layer norm → residual connections. Modern architecture: pre-norm (norm before, not after layer). Enables deeper models.

Self-Attention and Multi-Head Attention

Scaled Dot-Product Attention

Scale factor 1/sqrt(d_k) prevents attention weights saturating when d_k large. Without scale: large dot products → softmax near 0 or 1 (extreme), small gradients.

Multi-Head Details

H heads, each projects to d_k-dimensional subspace. Compute attention separately, concatenate results. Enables modeling different semantic relationships in parallel.

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) * W_o
head_i = Attention(Q*W_i^Q, K*W_i^K, V*W_i^V)

Each head learns separate projection and attention pattern

Positional Encoding

Attention mechanism permutation-invariant: position-agnostic. Add positional encodings to input embeddings. Sinusoidal positional encoding: PE_{pos, 2i} = sin(pos / 10000^{2i/d}), PE_{pos, 2i+1} = cos(...). Enables learning position-dependent relationships.

GPT Architecture and Autoregressive Generation

GPT Design

Decoder-only transformer. Autoregressive: predict next token given previous. Causal self-attention masks future (can't cheat). Fine-tune on downstream tasks by adding task-specific head.

Generation Process

Given prompt, generate token-by-token. At each step, output probability distribution over vocabulary. Select next token (greedy, sampling, top-k, top-p). Append to context, repeat.

Prompt: "The capital of France is"
Step 1: P(w | "The capital of France is") → sample/argmax
Step 2: P(w | "The capital of France is Paris") → sample/argmax
Step 3: P(w | "The capital of France is Paris is") → sample/argmax
...continues until stop token or max length

Scaling Laws

Model performance improves predictably with scale: parameters, data, compute. GPT-3 (175B params) vs. GPT-2 (1.5B): massive improvement. Emergent abilities appear: few-shot learning, instruction following emerge at scale, not present in smaller models.

In-Context Learning

GPT-3: few-shot learning via examples in prompt. No gradient update, only forward pass. "Show me 3 examples, now do this task." Enables rapid adaptation.

BERT Architecture and Bidirectional Context

Bidirectional Context

BERT unlike GPT: no causal masking. Each position attends to all other positions. Bidirectional context helps understanding (classification). Sacrifices autoregressive generation (can't sample sequentially).

Pre-Training Objectives

Masked Language Modeling (MLM): Randomly mask 15% of tokens. Predict masked words from surrounding context. Forces learning bidirectional representations.

Next Sentence Prediction (NSP): Predict if two sentences consecutive in corpus. Helps with understanding relationships between sentences.

Input: [CLS] The cat sat [MASK] the mat [SEP] The dog [MASK] [SEP]
 sentence A sentence B
MLM: Predict "on" at position 5, "ran" at position 8
NSP: Predict if B follows A in corpus

Fine-Tuning for Tasks

Add task-specific head on [CLS] token. Classification: linear layer → softmax. NER: linear layer per token. Question answering: predict start/end token positions. Minimal architecture changes, pre-trained weights do heavy lifting.

Pre-Training and Fine-Tuning

Pre-Training Scale

BERT: trained on 3.3B words (Books + Wikipedia). GPT-3: trained on 300B tokens. Massive unlabeled data availability. Pre-training expensive (weeks on 1000+ TPUs). But amortized over thousands of fine-tuning tasks.

Data Efficiency

Fine-tuning requires modest labeled data (100s to 10,000s examples). Without pre-training: need 100,000+ examples for comparable performance. Pre-training massive efficiency gain.

Fine-Tuning Strategies

  • Feature Extraction: Freeze pre-trained weights, train task head. Fast, works when task similar to pre-training objective.
  • Full Fine-tuning: Update all weights. Better final performance, requires more data and computation.
  • Adapter Modules: Insert small trainable modules, freeze backbone. Parameter-efficient, avoids catastrophic forgetting.
  • LoRA (Low-Rank Adaptation): Learn low-rank updates to weights. Minimal parameters, efficient, enables fast task switching.

Catastrophic Forgetting

Fine-tuning on new task erases pre-training knowledge. Mitigated: small learning rate, early stopping, regularization (L2 penalty on weights).

Decoding Strategies and Text Generation

Greedy Decoding

Select highest-probability token at each step. Fast, deterministic. Often poor quality (misses better sequences, gets stuck in loops).

Beam Search

Keep K best hypotheses (beams). Expand, prune to top K. Trade-off: K=1 is greedy, K=∞ is exhaustive. K=3-5 typical. Better quality than greedy, still tractable.

Sampling and Temperature

Sample from distribution (not argmax). Temperature T controls sharpness: high T (>1) smooth distribution, low T (<1) sharp. T=1 true distribution, T→0 greedy, T→∞ uniform.

P_T(w) = P(w)^{1/T} / sum_w' P(w')^{1/T}

High temperature: more diversity, risk incoherence
Low temperature: more confident, risk repetition

Top-K and Top-P (Nucleus Sampling)

Top-K: sample only from K highest-probability tokens. Top-P: sample from smallest set of tokens with cumulative probability ≥ P. Avoids extremely unlikely tokens. Better quality than raw sampling.

Stopping Criteria

Stop on end-of-sequence token, max length, or custom criteria. Avoid repetition: penalize repeated n-grams. Length penalty: normalize score by sequence length.

Applications and Impact

Machine Translation

Sequence-to-sequence: encoder reads source, decoder generates target. Attention attends to relevant source words. Transformers state-of-the-art. BLEU scores improved from 20s (SMT) to 30s+ (neural).

Text Generation

Story generation, summarization, paraphrasing. GPT-2 impressive generation despite simple pre-training objective. GPT-3 few-shot: generate by example. Applications: code generation, dialogue, creative writing.

Question Answering

SQuAD benchmark: machine reading comprehension. BERT fine-tuning achieves human-level performance. Downstream: customer service chatbots, FAQ systems.

Semantic Understanding

BERT embeddings used for sentence similarity, clustering, semantic search. Better than TF-IDF, word2vec for semantic tasks. Enable retrieval-augmented generation (find relevant context, generate answer).

Instruction Following

GPT-3 follows natural language instructions: "summarize the following in 3 bullet points", "write a poem about cats". Emergent ability from scale. InstructGPT fine-tuning improves alignment with user intent.

References

  • Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. "A Neural Probabilistic Language Model." JMLR, vol. 3, 2003.
  • Mikolov, T., Karafiát, M., Burget, L., et al. "Recurrent Neural Network based Language Model." INTERSPEECH, 2010.
  • Vaswani, A., Shazeer, N., Parmar, N., et al. "Attention Is All You Need." NIPS, 2017.
  • Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. "BERT: Pre-training of Deep Bidirectional Transformers." NAACL, 2019.
  • Brown, T. B., Mann, B., Ryder, N., et al. "Language Models are Few-Shot Learners." NIPS, 2020.