Introduction

Transformers: neural architectures introduced in 2017, designed for sequence-to-sequence tasks. Core innovation: attention mechanism replaces recurrence and convolution. Benefits: parallelizable training, long-range dependency modeling, state-of-the-art results in NLP and beyond.

"The Transformer allows for significantly more parallelization and thus reduced training times." -- Ashish Vaswani et al., 2017

Background and Motivation

Limitations of RNNs and CNNs

Recurrent Neural Networks (RNNs): sequential processing, vanishing gradients, slow training. Convolutional Neural Networks (CNNs): local receptive fields, limited context. Need: model long-range dependencies with efficient parallelism.

Emergence of Attention

Attention: dynamic weighting of input elements, enables model to focus on relevant parts of sequences. Initially used in encoder-decoder RNNs for machine translation. Foundation for Transformer design.

Sequence-to-Sequence Tasks

Tasks: machine translation, summarization, text generation. Challenge: capture context beyond fixed windows. Transformers address these effectively with attention-based representations.

Architecture Overview

Encoder-Decoder Structure

Encoder: multiple identical layers, each with self-attention and feed-forward sublayers. Decoder: similar layers plus encoder-decoder attention. Input embeddings passed through positional encoding.

Layer Components

Multi-head attention: parallel attention mechanisms. Position-wise feed-forward networks: fully connected layers applied identically at each position. Residual connections and layer normalization stabilize training.

Data Flow

Input tokens → embeddings → positional encodings → encoder layers → decoder layers with masked self-attention → output predictions.

ComponentFunction
Multi-head AttentionCaptures different representation subspaces by parallel attention heads
Feed-Forward NetworkApplies position-wise nonlinear transformation to encoder/decoder outputs
Positional EncodingInjects sequence order information into embeddings

Attention Mechanisms

Scaled Dot-Product Attention

Inputs: queries Q, keys K, values V matrices. Computation: attention weights via softmax(QK^T / √d_k), weighted sum over V. Scale factor √d_k prevents gradient vanishing.

Multi-Head Attention

Multiple attention heads learn diverse feature representations. Each head projects Q, K, V to lower-dimensional spaces. Outputs concatenated and linearly transformed.

Mathematical Formulation

Attention(Q, K, V) = softmax(QK^T / √d_k) VMultiHead(Q, K, V) = Concat(head_1, ..., head_h) W^Owhere head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)

Self-Attention

Definition and Role

Self-attention: attention where Q, K, V come from the same source sequence. Enables each token to attend to all tokens including itself, capturing context globally.

Computational Complexity

Complexity: O(n^2 * d), where n = sequence length, d = embedding dimension. Bottleneck for very long sequences. Variants proposed for efficiency.

Benefits

Parallelizable across sequence positions. Models dependencies regardless of distance. Avoids recurrence’s sequential constraints.

Positional Encoding

Need for Position Information

Transformers lack inherent sequence order awareness. Positional encoding injects explicit position data to embeddings.

Sine and Cosine Functions

Original method: fixed sinusoidal functions with different frequencies. Formula:

PE(pos, 2i) = sin(pos / 10000^(2i/d_model))PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Learned Embeddings

Alternative: position embeddings learned during training. Pros: adapt to data. Cons: less generalization to longer sequences.

Training Paradigms

Masked Language Modeling (MLM)

Randomly mask tokens, predict masked tokens given context. Used in BERT, enables bidirectional context understanding.

Autoregressive Training

Predict next token given previous tokens. Used in GPT models. Enables generative capabilities.

Sequence-to-Sequence Learning

Trains encoder-decoder on input-output pairs. Tasks: translation, summarization. Loss: cross entropy on output tokens.

Notable Transformer Models

BERT (Bidirectional Encoder Representations from Transformers)

Bidirectional encoder, trained with MLM and next sentence prediction. Excels in understanding tasks.

GPT (Generative Pre-trained Transformer)

Unidirectional decoder, trained autoregressively. Strong in text generation and few-shot learning.

Transformer-XL and Longformer

Address long-range dependency limits via segment recurrence and sparse attention respectively.

ModelKey FeaturePrimary Use
BERTBidirectional encoding, MLM pretrainingNatural language understanding
GPTUnidirectional decoding, autoregressiveText generation
Transformer-XLSegment-level recurrence, relative positional encodingLong context modeling
LongformerSparse attention for efficiency on long sequencesEfficient long document processing

Applications

Natural Language Processing (NLP)

Tasks: machine translation, sentiment analysis, question answering, summarization, language modeling.

Computer Vision

Vision Transformers (ViT): apply transformer architecture to image patches. Used for classification, detection, segmentation.

Speech and Audio Processing

Speech recognition, synthesis, and audio classification. Transformers model temporal dependencies in signals.

Other Domains

Protein folding, reinforcement learning, graph representation learning. Transformers increasingly cross-disciplinary.

Advantages and Limitations

Advantages

Parallelizable training, captures global context, flexible input-output lengths, strong transfer learning performance.

Limitations

Quadratic complexity in sequence length limits scalability. Requires large datasets and compute resources. Interpretability challenges.

Mitigation Strategies

Sparse attention, linearized attention approximations, memory-augmented models, distillation for smaller models.

Future Directions

Efficient Attention Models

Exploration of linear and sparse attention to reduce computational load and memory usage.

Multimodal Transformers

Integrate vision, language, audio inputs for unified representations and cross-modal generation.

Explainability and Robustness

Develop methods to interpret attention distributions and improve model stability against adversarial inputs.

Scaling and Adaptation

Scaling models with billions of parameters; adapting transformers for low-resource languages and domains.

References

  • Vaswani, A., et al. "Attention is All You Need." Advances in Neural Information Processing Systems, vol. 30, 2017, pp. 5998–6008.
  • Devlin, J., et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." arXiv preprint arXiv:1810.04805, 2018.
  • Radford, A., et al. "Language Models are Unsupervised Multitask Learners." OpenAI Blog, 2019.
  • Dai, Z., et al. "Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context." ACL, 2019, pp. 2978–2988.
  • Beltagy, I., Peters, M. E., and Cohan, A. "Longformer: The Long-Document Transformer." arXiv preprint arXiv:2004.05150, 2020.