Introduction
Transfer learning reuses knowledge from one domain to improve learning in another. Core insight: representations learned for one task useful for related tasks. Rather than learning from scratch, leverage pre-trained models, features, or knowledge structures.
Motivation: labeled data expensive and time-consuming to collect. Training from scratch slow. Transfer learning enables learning with limited data,knowledge from large source domain accelerates learning in small target domain. ImageNet pre-training fundamentally changed computer vision; transformers pre-trained on massive text corpora dominate NLP.
Transfer mechanism varies: parameter sharing (shared weights), feature reuse (use learned representations), knowledge distillation (compress knowledge into small model). Success depends on domain similarity,more similar source and target, more knowledge transfers.
"Transfer learning is one of the most important capabilities for building practical machine learning systems. Most real applications don't have millions of labeled examples." -- Andrew Ng, Stanford University
Motivation and Problem Definition
Traditional Learning Problem
Source and target domains identical: P(X)=P(X'), P(Y|X)=P(Y'|X'). Train on source, test on source. Millions of labeled examples available.
Transfer Learning Problem
Source and target differ: P(X)≠P(X') or P(Y|X)≠P(Y'|X'). Data distribution shift. Labeled target data scarce or nonexistent. Question: how to leverage source knowledge despite domain difference?
Key Challenges
- Domain Shift: Feature distributions differ. P(X_source) ≠ P(X_target). Input statistics changed.
- Label Shift: Class proportions differ. P(Y_source) ≠ P(Y_target). Imbalanced target classes.
- Concept Drift: Decision boundary changed. P(Y|X_source) ≠ P(Y|X_target). What defined class in source irrelevant in target.
- Negative Transfer: Source knowledge harmful. Target performance worse using source than learning from scratch.
Knowledge vs. Capacity vs. Data
Three drivers of learning: (1) model capacity (parameters), (2) training data quantity, (3) relevant prior knowledge. Transfer learning trades capacity/data for knowledge: use smaller model with less target data by incorporating source knowledge.
Transfer Learning Settings
Inductive Transfer Learning
Source and target tasks differ, target labels available (small amounts). Learn target-specific hypothesis using source knowledge. Example: medical image classification (source: natural images ImageNet, target: X-rays).
Transductive Transfer Learning
Source and target tasks same, domains differ, target labels unavailable. Example: sentiment analysis trained on product reviews (source), applied to movie reviews (target). No labeled target data.
Unsupervised Transfer Learning
Source has labels, target has no labels, tasks may differ. Discover common structure without target supervision. Example: clustering documents from different corpora.
Domain Generalization
Train on multiple source domains, test on unseen target domain. Objective: learn domain-invariant representations working across diverse domains.
| Setting | Source Labels | Target Labels | Task Similarity |
|---|---|---|---|
| Inductive | Yes (abundant) | Yes (few) | Same or different |
| Transductive | Yes (abundant) | No | Same |
| Unsupervised | Yes (abundant) | No | Different |
| Domain Generalization | Yes (multiple domains) | No | Same across diverse domains |
Fine-Tuning and Pre-Training
Standard Fine-Tuning Approach
Pre-train model on large source dataset. Transfer learned weights to target. Fine-tune on small target dataset with lower learning rate (smaller updates preserve source knowledge).
1. Pre-training: Train deep network on source (ImageNet, 1M+ examples, thousands of classes)2. Transfer: Copy weights to new network with target architecture3. Fine-tuning: Train on target with lower learning rate (0.1x to 0.01x of pre-training) - Early layers (features) often frozen, only fine-tune later layers (task-specific) - Or fine-tune all layers with lower learning rateLayer-Wise Transfer
Early layers learn generic low-level features (edges, textures). Late layers learn task-specific patterns. Strategy: freeze early layers, fine-tune late layers when target data limited. As target data increases, fine-tune more layers.
Learning Rate Scheduling
Pre-training learning rate: 0.001 to 0.01 (large updates). Fine-tuning learning rate: 0.0001 to 0.0001 (tiny updates to preserve knowledge). Discriminative learning rates: different learning rates per layer (deeper layers trained slower).
Batch Normalization in Transfer
BN parameters (mean, variance) learned on source. Options: (1) freeze BN statistics, use source statistics, (2) update BN on target data. Freezing preserves source knowledge. Updating adapts to target distribution. Context-dependent choice.
Success Factors
Transfer effectiveness depends on source-target similarity. Visual recognition: ImageNet pre-training universal, helps most vision tasks. Domain mismatch: medical imaging pre-training requires different source. Task mismatch: pre-training on classification less helpful for detection.
Domain Adaptation
Unsupervised Domain Adaptation
Source labeled, target unlabeled. Goal: learn to classify target despite domain shift. Key insight: reduce domain divergence by learning domain-invariant representations.
Approach: Learn features such that source and target indistinguishable to adversary- Adversarial loss: discriminator tries to tell source/target apart- Feature loss: features from both domains similar (high-level structure preserved)Domain Adversarial Training
Minimax game: feature extractor minimizes domain classifier loss (confuse classifier), domain classifier maximizes discrimination. Feature extractor learns domain-invariant features.
Feature extractor F: learn representationsDomain classifier D: classify source (0) vs. target (1)Loss = L_task(F(x_s), y_s) - lambda*L_domain(D(F(x_t)), 1) [task loss] [adversarial loss]Adversarial loss encourages F(x_t) indistinguishable from F(x_s)Distribution Alignment Methods
Maximum Mean Discrepancy (MMD): Measure distance between distributions. Minimize ||E[F(x_s)] - E[F(x_t)]||. Simple, direct alignment.
CORAL (Correlation Alignment): Align second-order statistics. Match means and covariances. Preserves local geometry better than MMD.
Optimal Transport: Find optimal mapping transporting source distribution to target. Preserves geometric structure. Computationally expensive.
Self-Training and Pseudo-Labeling
Train on source, apply to target, use high-confidence predictions as pseudo-labels. Retrain on source + pseudo-labeled target. Iterative refinement. Works when target distribution similar to source.
Few-Shot and Zero-Shot Learning
Few-Shot Learning Problem
Learn new classes from handful of examples. N-way K-shot: N classes, K examples per class (total N*K labeled examples). Extreme transfer scenario,must generalize from minimal data.
Metric Learning Approach
Learn distance metric or similarity function. At test time, classify by similarity to few-shot examples. Prototypical networks: class prototype = mean of support examples. Query example classified to nearest prototype.
Support set: S = {(x_i, y_i) : few labeled examples}Query example: x_qPrototype: c_y = mean{f(x) : y=y in S}Prediction: argmin_y ||f(x_q) - c_y||^2Meta-Learning (Learning to Learn)
Train on many few-shot tasks, learn optimization procedure or initialization. Model-Agnostic Meta-Learning (MAML): learn initial weights such that one gradient step on task achieves good accuracy.
MAML update:
For each task: 1. Inner loop: compute loss on support set, take gradient step 2. Evaluate on query setMeta update: optimize initial weights to minimize query loss after one gradient stepZero-Shot Learning
Classify classes never seen during training. Requires auxiliary information: semantic attributes, word embeddings, descriptions. Learn mapping from images to semantic space.
Example: classify animals using attributes (wings, fur, legs). For unseen bird, predict high values for "wings" and "feathers".
Advantages and Limitations
Advantages: enable learning with minimal data, discover what model learns about common properties. Limitations: few-shot accuracy lower than supervised, requires good metric/embedding space, zero-shot requires auxiliary info.
Multi-Task Learning
Motivation
Learn multiple related tasks simultaneously. Tasks share representations but differ in specifics. Shared hidden layers capture common structure; task-specific layers capture differences.
Shared representation layer: h = f_shared(x)Task 1 head: y_1 = W_1 * h + b_1Task 2 head: y_2 = W_2 * h + b_2...Loss: L = w_1*L_1(y_1, y_1_true) + w_2*L_2(y_2, y_2_true)Hard vs. Soft Parameter Sharing
| Approach | Structure | When to Use |
|---|---|---|
| Hard Parameter Sharing | Shared hidden layers, task-specific output layers | Similar tasks, input modality same |
| Soft Parameter Sharing | Task-specific networks, regularize weights to be similar | Diverse tasks, careful tuning needed |
| Cross-Stitch Networks | Learned mixture of task-specific representations | Moderate task similarity, automatic sharing level |
Balancing Task Losses
Tasks have different scales. Loss_1 in [0,1], Loss_2 in [0,100]. Unbalanced weighting. Solutions: (1) normalize by task-specific loss scales, (2) uncertainty weighting (learn task-specific weights), (3) dynamic weighting adjusting during training.
Benefits
Shared representations improve generalization. Beneficial for related tasks with limited individual data. Real-world scenario: learn joint model predicting temperature, humidity, pressure. Each task alone has little data, jointly have signal.
Meta-Learning (Learning to Learn)
Core Idea
Learn algorithm or procedure for learning new tasks from few examples. Explicitly optimize for fast adaptation. Meta-training: expose to many tasks, evaluate on held-out test tasks.
MAML Deep Dive
Model-Agnostic Meta-Learning. Learn initial parameters such that single gradient step (or few steps) on new task's support set improves significantly on query set.
Outer loop (meta-training): For each task i: Inner loop (task adaptation): Initialize: theta_i = theta (copy initial parameters) For 1-5 steps: theta_i -= alpha * grad L_task_i(theta_i) Evaluate: compute loss on query set with theta_i Meta-update: theta -= beta * grad_theta [query loss]Key: inner loop loss gradient informs outer loop. Learns initialization enabling fast task adaptation.
Prototypical Networks
Learn embedding space where examples cluster by class. N-way K-shot: compute prototype (mean) per class, classify by nearest neighbor in embedding space.
Relation Networks
Learn similarity metric between query and support examples. Relation network outputs similarity score. Learns task-relevant distance metric (vs. fixed metric like Euclidean).
Applications
Few-shot classification, few-shot regression, rapid domain adaptation. Meta-learning enables models treating incoming new tasks as learning problem itself (learning while learning).
Feature Reuse and Representation Learning
Hierarchy of Features
Deep networks learn hierarchical representations. Early layers: edges, colors (pixel-level). Middle layers: textures, parts. Late layers: objects, concepts. Hierarchy useful across tasks,edge detection useful for most vision tasks.
What Transfers Across Tasks?
Early features (low-level) transfer universally. CNNs trained on ImageNet transfer to medical imaging, satellite imagery. Task-specific features transfer weakly,object recognition features unhelpful for texture classification.
Universal Representations
BERT (natural language): pre-trained on massive text corpus, fine-tuned on downstream tasks (classification, NER, QA). Representations learned capture linguistic structure useful across NLP tasks.
ImageNet: 1.2M images, 1000 classes. Training time: weeks on high-end GPUs. Learned features so broadly useful, become standard transfer source across computer vision.
Feature Extraction vs. Fine-Tuning
Feature Extraction: Freeze pre-trained weights, use as fixed feature extractor. Train only final classification layer. Fast, requires little target data. Works when target distribution similar.
Fine-Tuning: Update all weights on target task. Slower, requires more target data. Better final accuracy when source/target very different or target large.
Mixing Strategies
Freeze early layers, fine-tune late layers. Trade-off between preserving source knowledge and adapting to target. Layer-wise learning rates: deeper layers smaller learning rates (preserve pre-training), shallow layers higher rates (adapt faster).
Negative Transfer and Domain Mismatch
When Transfer Fails
Source knowledge sometimes harmful. Learning target domain alone outperforms with source transfer. Occurs when: (1) source and target very different, (2) source noisy/poor quality, (3) bad fine-tuning choices.
Quantifying Domain Distance
Broader theory: domain divergence H-divergence. Large divergence → harder transfer. Negative transfer likely when divergence exceeds adaptation capacity.
Detecting and Preventing Negative Transfer
- Validation monitor: Track validation accuracy. If decreases during fine-tuning, reduce learning rate or freeze more layers.
- Source quality: Poor source model transfers poorly. Verify source performance on source domain first.
- Task relevance: Ensure source task conceptually related to target. Random source worse than useful source.
- Fine-tuning strategy: Conservative approach (freeze early layers, fine-tune late only) safer than aggressive updating.
Selective Transfer
Learn which source knowledge to transfer. Weight individual layers by domain similarity. Some layers useful, others harmful. Learned weighting discovers beneficial subsets.
Applications
Computer Vision
ImageNet pre-training nearly universal for vision tasks. Object detection, segmentation, instance segmentation all use pre-trained backbones. Fine-tuning on domain-specific data (medical, aerial, satellite) enables rapid model development.
Natural Language Processing
ELMo, BERT, GPT-2, GPT-3: pre-training on massive corpora. Fine-tuning achieves state-of-the-art on classification, named entity recognition, machine translation, question answering with modest labeled data.
Medical Imaging
Limited labeled medical images (expensive annotation by radiologists). Transfer from ImageNet, then fine-tune on medical images. Or: pre-train on unlabeled medical images (contrastive learning), fine-tune on labeled. Dramatically improves diagnoses.
Robotics
Learning manipulation tasks. Sim-to-real transfer: policies trained in simulation transferred to real robots. Addresses sample efficiency (simulation cheap, real expensive). Domain randomization bridges sim-real gap.
Domain-Specific Applications
Sentiment analysis: models pre-trained on reviews, fine-tuned on target domain (social media, product feedback). Fraud detection: transfer from public datasets, fine-tune on bank's proprietary transactions.
References
- Bengio, Y. "Deep Learning of Representations for Unsupervised and Transfer Learning." ICML Workshop, 2011.
- Finn, C., Abbeel, P., and Levine, S. "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks." ICML, 2017.
- Ganin, Y., Ustinova, E., Ajakan, H., et al. "Unsupervised Domain Adaptation by Backpropagation." JMLR, vol. 37, 2016.
- Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL, 2019.
- Snell, J., Swersky, K., and Zemel, R. S. "Prototypical Networks for Few-shot Learning." NIPS, 2017.