Introduction
Transformers: neural architectures introduced in 2017, designed for sequence-to-sequence tasks. Core innovation: attention mechanism replaces recurrence and convolution. Benefits: parallelizable training, long-range dependency modeling, state-of-the-art results in NLP and beyond.
"The Transformer allows for significantly more parallelization and thus reduced training times." -- Ashish Vaswani et al., 2017
Background and Motivation
Limitations of RNNs and CNNs
Recurrent Neural Networks (RNNs): sequential processing, vanishing gradients, slow training. Convolutional Neural Networks (CNNs): local receptive fields, limited context. Need: model long-range dependencies with efficient parallelism.
Emergence of Attention
Attention: dynamic weighting of input elements, enables model to focus on relevant parts of sequences. Initially used in encoder-decoder RNNs for machine translation. Foundation for Transformer design.
Sequence-to-Sequence Tasks
Tasks: machine translation, summarization, text generation. Challenge: capture context beyond fixed windows. Transformers address these effectively with attention-based representations.
Architecture Overview
Encoder-Decoder Structure
Encoder: multiple identical layers, each with self-attention and feed-forward sublayers. Decoder: similar layers plus encoder-decoder attention. Input embeddings passed through positional encoding.
Layer Components
Multi-head attention: parallel attention mechanisms. Position-wise feed-forward networks: fully connected layers applied identically at each position. Residual connections and layer normalization stabilize training.
Data Flow
Input tokens → embeddings → positional encodings → encoder layers → decoder layers with masked self-attention → output predictions.
| Component | Function |
|---|---|
| Multi-head Attention | Captures different representation subspaces by parallel attention heads |
| Feed-Forward Network | Applies position-wise nonlinear transformation to encoder/decoder outputs |
| Positional Encoding | Injects sequence order information into embeddings |
Attention Mechanisms
Scaled Dot-Product Attention
Inputs: queries Q, keys K, values V matrices. Computation: attention weights via softmax(QK^T / √d_k), weighted sum over V. Scale factor √d_k prevents gradient vanishing.
Multi-Head Attention
Multiple attention heads learn diverse feature representations. Each head projects Q, K, V to lower-dimensional spaces. Outputs concatenated and linearly transformed.
Mathematical Formulation
Attention(Q, K, V) = softmax(QK^T / √d_k) VMultiHead(Q, K, V) = Concat(head_1, ..., head_h) W^Owhere head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)Self-Attention
Definition and Role
Self-attention: attention where Q, K, V come from the same source sequence. Enables each token to attend to all tokens including itself, capturing context globally.
Computational Complexity
Complexity: O(n^2 * d), where n = sequence length, d = embedding dimension. Bottleneck for very long sequences. Variants proposed for efficiency.
Benefits
Parallelizable across sequence positions. Models dependencies regardless of distance. Avoids recurrence’s sequential constraints.
Positional Encoding
Need for Position Information
Transformers lack inherent sequence order awareness. Positional encoding injects explicit position data to embeddings.
Sine and Cosine Functions
Original method: fixed sinusoidal functions with different frequencies. Formula:
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))Learned Embeddings
Alternative: position embeddings learned during training. Pros: adapt to data. Cons: less generalization to longer sequences.
Training Paradigms
Masked Language Modeling (MLM)
Randomly mask tokens, predict masked tokens given context. Used in BERT, enables bidirectional context understanding.
Autoregressive Training
Predict next token given previous tokens. Used in GPT models. Enables generative capabilities.
Sequence-to-Sequence Learning
Trains encoder-decoder on input-output pairs. Tasks: translation, summarization. Loss: cross entropy on output tokens.
Notable Transformer Models
BERT (Bidirectional Encoder Representations from Transformers)
Bidirectional encoder, trained with MLM and next sentence prediction. Excels in understanding tasks.
GPT (Generative Pre-trained Transformer)
Unidirectional decoder, trained autoregressively. Strong in text generation and few-shot learning.
Transformer-XL and Longformer
Address long-range dependency limits via segment recurrence and sparse attention respectively.
| Model | Key Feature | Primary Use |
|---|---|---|
| BERT | Bidirectional encoding, MLM pretraining | Natural language understanding |
| GPT | Unidirectional decoding, autoregressive | Text generation |
| Transformer-XL | Segment-level recurrence, relative positional encoding | Long context modeling |
| Longformer | Sparse attention for efficiency on long sequences | Efficient long document processing |
Applications
Natural Language Processing (NLP)
Tasks: machine translation, sentiment analysis, question answering, summarization, language modeling.
Computer Vision
Vision Transformers (ViT): apply transformer architecture to image patches. Used for classification, detection, segmentation.
Speech and Audio Processing
Speech recognition, synthesis, and audio classification. Transformers model temporal dependencies in signals.
Other Domains
Protein folding, reinforcement learning, graph representation learning. Transformers increasingly cross-disciplinary.
Advantages and Limitations
Advantages
Parallelizable training, captures global context, flexible input-output lengths, strong transfer learning performance.
Limitations
Quadratic complexity in sequence length limits scalability. Requires large datasets and compute resources. Interpretability challenges.
Mitigation Strategies
Sparse attention, linearized attention approximations, memory-augmented models, distillation for smaller models.
Future Directions
Efficient Attention Models
Exploration of linear and sparse attention to reduce computational load and memory usage.
Multimodal Transformers
Integrate vision, language, audio inputs for unified representations and cross-modal generation.
Explainability and Robustness
Develop methods to interpret attention distributions and improve model stability against adversarial inputs.
Scaling and Adaptation
Scaling models with billions of parameters; adapting transformers for low-resource languages and domains.
References
- Vaswani, A., et al. "Attention is All You Need." Advances in Neural Information Processing Systems, vol. 30, 2017, pp. 5998–6008.
- Devlin, J., et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." arXiv preprint arXiv:1810.04805, 2018.
- Radford, A., et al. "Language Models are Unsupervised Multitask Learners." OpenAI Blog, 2019.
- Dai, Z., et al. "Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context." ACL, 2019, pp. 2978–2988.
- Beltagy, I., Peters, M. E., and Cohan, A. "Longformer: The Long-Document Transformer." arXiv preprint arXiv:2004.05150, 2020.