Introduction
Reinforcement learning (RL) is learning to maximize cumulative reward through interaction with an environment. Unlike supervised learning (which has target labels), RL receives scalar rewards and must discover which actions led to good outcomes. Agent takes actions, observes new state and reward, learns policy mapping states to actions that maximize long-term reward.
Core insight: learning problem decomposed into sequences of decisions. Each decision's value depends not just on immediate reward but on consequences for future rewards. Temporal credit assignment: which actions deserve credit for distant future rewards?
Applications: game playing (AlphaGo defeated Lee Sedol using RL + MCTS), robot control (learning manipulation tasks), autonomous vehicles, resource optimization, recommendation systems. Challenges: sample efficiency (learning from limited experience), exploration (discovering effective strategies), stability (training can diverge).
"Reinforcement learning is learning what to do,how to map situations to actions,so as to maximize a numerical reward signal. The learner is not told which actions to take, but instead must discover which actions yield highest rewards." -- Richard Sutton, Reinforcement Learning pioneer
Problem Formulation: Markov Decision Processes
MDP Definition
Formalize RL problem as Markov Decision Process (MDP): tuple (S, A, P, R, gamma). S = state space. A = action space. P = transition probability P(s'|s,a). R = reward function R(s,a,s'). Gamma = discount factor (0 to 1) weighing future rewards.
Markov property: future depends only on current state, not history. State summarizes all relevant past information.
MDP Components:- State s: representation of environment (position, velocity, board configuration)- Action a: agent's choice (move, jump, push)- Transition: P(s'|s,a) probability of reaching s' from s via a- Reward: R(s,a,s') immediate numerical feedback- Discount gamma: 0 < gamma < 1, future rewards worth less than immediateReturn and Objective
Return G_t = sum of discounted future rewards: R_t + gamma*R_{t+1} + gamma^2*R_{t+2} + ... Objective: find policy pi(a|s) maximizing expected return E[G_t].
Episodic vs. Continuing Tasks
Episodic: interaction terminates (chess game ends, episode completes). Continuing: infinite horizon (robot walking). For continuing, discount factor prevents infinite return.
Value Functions and Bellman Equations
State Value Function
V(s) = expected return from state s under policy pi. Answers: "How good is this state?"
V_pi(s) = E[G_t | S_t = s] = E[R_t + gamma*V_pi(s') | S_t = s]Can be computed recursively: value of state is immediate reward plus discounted value of next stateAction Value Function (Q-function)
Q(s,a) = expected return from taking action a in state s then following policy pi. Answers: "How good is this action in this state?"
Q_pi(s,a) = E[G_t | S_t = s, A_t = a] = E[R_t + gamma*V_pi(s')]Key difference: V depends on state only, Q depends on state AND actionBellman Equations
Recursive relationships enabling computation. State value decomposes into immediate reward plus discounted value of successor:
V_pi(s) = sum_a pi(a|s) * sum_{s',r} P(s',r|s,a) * [r + gamma*V_pi(s')]Q_pi(s,a) = sum_{s',r} P(s',r|s,a) * [r + gamma*sum_a' pi(a'|s') * Q_pi(s',a')]Optimal value: V*(s) = max_a E[R_t + gamma*V*(s') | A_t = a]Policy Evaluation
Given policy, iteratively compute V: V_{k+1}(s) = sum_a pi(a|s) sum_{s',r} P(s',r|s,a) [r + gamma*V_k(s')]. Converges to true V_pi.
Dynamic Programming Methods
Policy Iteration
Alternately evaluate and improve policy until convergence:
1. Policy Evaluation: compute V for current policy2. Policy Improvement: pi'(s) = argmax_a Q(s,a)3. Repeat until policy stableGuaranteed convergence to optimal policyDrawback: requires full model P(s',r|s,a). Most RL problems don't have explicit model.
Value Iteration
Combine evaluation and improvement in single step:
V_{k+1}(s) = max_a sum_{s',r} P(s',r|s,a) [r + gamma*V_k(s')]More efficient than policy iteration. Converges to optimal V*When DP Applies
DP requires: (1) full knowledge of environment model, (2) computational capacity to process all states. Neither typically true in practice. RL methods relax these requirements.
Temporal Difference Learning
Core Idea
Learn from incomplete episodes. Don't wait for episode to end,update values using bootstrapping: estimate future value from current V estimate. Combines advantages of DP (model-free) and Monte Carlo (learns from experience).
TD(0) Algorithm
For each step: delta = R_t + gamma*V(S_{t+1}) - V(S_t) [TD error] V(S_t) <- V(S_t) + alpha*delta [update toward bootstrapped target]Delta measures how far current estimate was from (reward + discounted next value)Advantages Over Monte Carlo
MC must wait for episode end to update. TD updates immediately using bootstrapped value. Reduces variance (uses estimate not full return). Can learn online, from incomplete episodes.
TD Lambda and Eligibility Traces
TD(lambda) blends n-step returns. Lambda=0 is TD(0), lambda=1 approaches Monte Carlo. Eligibility traces track which states were visited recently,useful for credit assignment in long episodes.
e(s) = lambda*gamma*e(s)_previous + 1 [eligibility decays, increments on visit]V(s) <- V(s) + alpha*delta*e(s) [update proportional to eligibility]Q-Learning and Deep Q-Networks
Q-Learning Algorithm
Off-policy method: learn optimal policy while following exploratory policy. Uses greedy bootstrap estimate:
Q(S_t,A_t) <- Q(S_t,A_t) + alpha*[R_t + gamma*max_a Q(S_{t+1},a) - Q(S_t,A_t)] ^^^^^^^^ greedy target (optimal action value)Off-policy: agent explores (takes random actions), but updates target uses max Q (optimal action). Decouples behavior policy (exploration) from target policy (greedy).
Convergence and Guarantees
Converges to optimal Q* under conditions: (1) all states visited infinitely often, (2) learning rate decays properly. For practical problems with function approximation, convergence not guaranteed but often works.
Deep Q-Networks (DQN)
Use neural network to approximate Q(s,a). Network takes state, outputs action values. Enables scaling to high-dimensional states (images, complex observations).
DQN Innovations
| Technique | Problem Solved | Mechanism |
|---|---|---|
| Experience Replay | Correlation between samples | Store transitions, sample minibatch from replay buffer |
| Target Network | Non-stationary targets (moving goalpost) | Separate network for bootstrap target, update periodically |
| Dueling DQN | Separation of value and advantage | Network outputs V(s) and A(s,a), combines as Q = V + A |
| Double DQN | Overestimation of Q-values | Decouple action selection from evaluation |
Limitations
DQN struggles with: continuous actions (Q(s,a) requires discrete action space), off-policy inefficiency (data from old policy), sample complexity (requires millions of interactions).
Policy Gradient Methods
Core Concept
Instead of learning V or Q, directly learn policy theta = pi_theta(a|s). Parameterize policy with neural network. Gradient ascent on expected return:
theta <- theta + alpha*nabla_theta E[G_t]Policy gradient theorem: nabla_theta E[G_t] = E[nabla_theta log pi_theta(A_t|S_t) * Q(S_t,A_t)]Higher Q(s,a) actions have gradients reinforced; lower Q actions suppressedREINFORCE Algorithm
For each trajectory: For each step t: theta <- theta + alpha*nabla_theta log pi_theta(A_t|S_t) * G_tUses actual return G_t as credit signal. High-variance but unbiased.Advantages Over Q-Learning
Handles continuous actions naturally. On-policy (follows learned policy). More stable when trained properly. Directly optimizes objective (expected return).
Actor-Critic Baseline
Reduce variance by subtracting baseline b(s) (typically V(s)):
theta <- theta + alpha*nabla_theta log pi_theta(A_t|S_t) * (G_t - V(S_t))Advantage A(s,a) = G_t - V(s) shows if action better/worse than average. Reduces variance without bias.Actor-Critic Methods
Architecture
Two networks: Actor (policy pi) and Critic (value V). Actor selects actions, Critic evaluates. Critic's value estimates guide actor's learning.
Actor: theta_pi, policy pi_theta(a|s)Critic: theta_v, value V_theta(s)Actor update: theta_pi <- theta_pi + alpha_pi * nabla_theta log pi(A_t|S_t) * (R_t + gamma*V(S_{t+1}) - V(S_t))Critic update: theta_v <- theta_v + alpha_v * (R_t + gamma*V(S_{t+1}) - V(S_t))^2Advantage Algorithms (A3C, PPO)
Asynchronous Advantage Actor-Critic: parallel actors accumulate experience, periodically update shared network. Reduces correlation, stabilizes training.
Proximal Policy Optimization (PPO): clipped objective prevents large policy changes:
L(theta) = E[min(r_t * A_t, clip(r_t, 1-epsilon, 1+epsilon) * A_t)]r_t = pi_new(a|s) / pi_old(a|s) [probability ratio]Prevents excessively large updates. Popular, sample-efficient, stable.Trust Region Methods
TRPO (Trust Region Policy Optimization): constrain policy updates to trust region where old approximation valid. Guarantees monotonic improvement. Computationally expensive but theoretically sound.
Exploration-Exploitation Trade-off
Core Dilemma
Exploitation: use current best-known action. Exploration: try unknown actions seeking better rewards. Pure exploitation misses better strategies. Pure exploration wastes time on suboptimal actions. Balance critical for learning.
Epsilon-Greedy
With probability epsilon, take random action (explore). With probability 1-epsilon, take greedy action (exploit). Simple, effective. Epsilon typically decays over time.
Upper Confidence Bound (UCB)
Explore actions with high uncertainty. Select action a = argmax_a [Q(a) + c*sqrt(ln(N)/N(a))]
First term: exploitation (high Q value). Second term: exploration (high uncertainty, N(a) = times action tried). Balances automatically based on value and uncertainty.
Thompson Sampling
Maintain distribution over Q-values. Sample from distribution, act greedily w.r.t. sample. Naturally balances exploration-exploitation. Theoretically optimal.
Curiosity-Driven Exploration
Intrinsic motivation: reward prediction error. Agent bonuses for states where model uncertain (high prediction error). Discovers interesting states, accelerates learning in sparse-reward environments.
Applications and Breakthroughs
Game Playing
AlphaGo (2016): combination of deep networks, Monte Carlo tree search, and RL beat Lee Sedol in Go (400 trillion positions). AlphaZero learned chess, shogi, Go from scratch, surpassing human champions.
Robot Control
Learn manipulation (reaching, grasping, in-hand reorientation). Learning from sim-to-real: policy trained in simulation transfers to real robots (sometimes). Challenges: sample efficiency, safety.
Autonomous Vehicles
RL for motion planning, trajectory optimization, interaction with other agents. Most systems use RL combined with classical planning.
Resource Optimization
Data center cooling, network routing, power systems. RL learns adaptive control policies, often outperforming hand-tuned heuristics.
Recommendation Systems
Bandit algorithms (contextual bandits) balance exploration and exploitation. Learn which items to recommend, treating user feedback as reward signal.
Challenges and Open Problems
Sample Efficiency
Deep RL typically requires millions of interactions. Humans learn from dozens. Challenges: high-dimensional action spaces, long horizons requiring planning, sparse rewards. Research: model-based RL, imitation learning, meta-learning.
Reward Specification
Designing reward function difficult. Rewards should incentivize desired behavior but often encode unintended solutions (reward hacking). Example: agent exploits glitch for high score rather than intended gameplay.
Non-Stationary Environments
Real environments change. Agent must adapt continually. Catastrophic forgetting: learning new task erases old knowledge. Research: multi-task learning, continual learning.
Exploration in Complex Spaces
Random exploration fails in high-dimensional spaces. Needles-in-haystack problem: good rewards extremely rare. Solutions: better intrinsic motivation, hierarchical RL, curriculum learning.
Partial Observability
Agent sees partial state (observations not full Markov state). Requires memory (RNNs). POMDPs harder than MDPs. Theoretical understanding incomplete.
Generalization
Policies trained on one domain often fail on similar variants. Transfer learning in RL underdeveloped. Real progress requires better abstraction, hierarchical representations.
References
- Sutton, R. S., and Barto, A. G. "Reinforcement Learning: An Introduction." MIT Press, 2nd edition, 2018.
- Mnih, V., Kavukcuoglu, K., Silver, D., et al. "Playing Atari with Deep Reinforcement Learning." NIPS, 2013.
- Silver, D., Schrittwieser, J., Simonyan, K., et al. "Mastering the Game of Go without Human Knowledge." Nature, vol. 550, 2017.
- Schulman, J., Wolski, F., Dhariwal, P., et al. "Proximal Policy Optimization Algorithms." arXiv:1707.06347, 2017.
- Lillicrap, T., Hunt, J. J., Pritzel, A., et al. "Continuous Control with Deep Reinforcement Learning." ICLR, 2016.