Definition and Scope
Overview
Reinforcement Learning (RL): area of machine learning focused on agents learning optimal actions by maximizing cumulative reward. Interaction-based: agent observes state, takes action, receives feedback. Model-free or model-based approaches. Goal: learn policy mapping states to actions for maximum return.
Core Elements
Agent: learner or decision maker. Environment: external system reacting to actions. State: environment snapshot at time t. Action: choice made by agent. Reward: scalar feedback signal. Policy: strategy used by agent. Value function: expected reward estimation.
Scope
Applications: robotics, game playing, autonomous systems, finance. Relations: overlaps with control theory, supervised learning, unsupervised learning. Distinct by sequential decision focus and delayed reward optimization.
"Reinforcement learning enables agents to achieve goals in complex, uncertain environments by learning from experience." -- Richard S. Sutton & Andrew G. Barto
Markov Decision Processes
Definition
MDP: formal framework modeling RL problems. Components: set of states S, set of actions A, transition probability P(s'|s,a), reward function R(s,a), discount factor γ ∈ [0,1]. Markov property: future state depends only on current state and action.
Mathematical Formulation
At time t: state s_t, action a_t, next state s_{t+1} ~ P(·|s_t,a_t), reward r_t = R(s_t,a_t). Objective: maximize expected discounted return G_t = ∑_{k=0}^∞ γ^k r_{t+k}.
Types of MDPs
Finite vs. infinite state/action spaces. Episodic vs. continuing tasks. Deterministic or stochastic transitions. Partially observable MDPs (POMDPs): states not fully visible, require belief states.
Value Functions
State-Value Function (V)
Definition: expected return starting from state s following policy π. Expressed as V^π(s) = E_π[G_t | s_t = s]. Measures desirability of states under policy.
Action-Value Function (Q)
Definition: expected return starting from state s, taking action a, thereafter following π. Q^π(s,a) = E_π[G_t | s_t=s, a_t=a]. Core to many algorithms for policy improvement.
Bellman Equations
Recursive relationships defining value functions. For V^π: V^π(s) = ∑_a π(a|s) ∑_{s',r} P(s',r|s,a)[r + γ V^π(s')]. For optimal values V*, Q* use max over actions.
| Value Function | Definition |
|---|---|
| State-Value V^π(s) | Expected return starting at state s under policy π |
| Action-Value Q^π(s,a) | Expected return starting at state s, taking action a, then following π |
Policy and Policy Optimization
Policy Types
Deterministic: maps states to single actions. Stochastic: maps states to probability distributions over actions. Parametric policies often optimized directly.
Policy Evaluation
Process of computing value functions for a given policy. Methods: iterative policy evaluation, Monte Carlo estimation, temporal difference learning.
Policy Improvement
Using value functions to derive improved policies. Policy iteration alternates evaluation and improvement until convergence. Guarantees monotonic improvement under certain conditions.
Exploration vs. Exploitation
Trade-Off
Exploration: trying new actions to discover better rewards. Exploitation: selecting known best actions to maximize reward. Balancing critical for effective learning.
Strategies
Epsilon-greedy: selects random action with probability ε, best action otherwise. Softmax: probabilistic selection weighted by action values. Upper Confidence Bound (UCB): optimism in face of uncertainty.
Impact on Convergence
Insufficient exploration: suboptimal policies. Excessive exploration: slow convergence. Optimal schedules often decay exploration rate over time.
Core Learning Algorithms
Dynamic Programming (DP)
Requires full knowledge of MDP. Uses Bellman backups to compute value functions. Examples: value iteration, policy iteration. Complexity: scales poorly with large state spaces.
Monte Carlo Methods
Learn from complete episodes without model knowledge. Estimate returns by averaging sampled episodes. Suitable for episodic tasks. Requires episodes to terminate.
Temporal Difference (TD) Learning
Combines Monte Carlo and DP advantages. Updates estimates based on bootstrapping from current estimate. Examples: TD(0), SARSA, Q-learning. Model-free and online.
| Algorithm | Characteristics |
|---|---|
| Value Iteration | Dynamic Programming, requires MDP model, iterative Bellman updates |
| Monte Carlo | Model-free, episodic, averages sampled returns |
| Q-learning | Model-free, off-policy TD control algorithm |
Function Approximation
Motivation
Exact representation infeasible in large/continuous state or action spaces. Approximate value functions or policies using parametrized models.
Types of Approximators
Linear functions: weighted sums of features. Nonlinear functions: neural networks, decision trees, kernel methods. Trade-offs: bias-variance, interpretability.
Challenges
Stability: divergence risks with bootstrapping and nonlinear function approximators. Sample efficiency: more data needed. Techniques: experience replay, target networks.
Deep Reinforcement Learning
Introduction
Integration of deep neural networks with RL to handle high-dimensional inputs and complex policies. Enables end-to-end learning from raw sensory data.
Notable Algorithms
Deep Q-Networks (DQN): value-based, uses CNNs for visual inputs. Policy Gradient Methods: directly optimize parametrized policies. Actor-Critic: combines value and policy learning.
Techniques for Stability
Experience Replay: stores past transitions for randomized updates. Target Networks: stabilize Q-value targets. Reward clipping, normalization, and entropy regularization.
Applications
Robotics
Autonomous control, manipulation, locomotion. Learning motor skills through trial and error. Adaptation to dynamic environments.
Game Playing
Classic: chess, Go, backgammon. Recent: real-time strategy games, Atari. Achieved human-level or superhuman performance.
Finance and Operations
Portfolio optimization, algorithmic trading, resource allocation. Dynamic decision-making under uncertainty.
Challenges and Limitations
Sample Inefficiency
Large data requirements due to trial-and-error paradigm. Expensive in real-world systems where actions have cost or risk.
Stability and Convergence
Non-stationary targets, function approximation instability. Lack of guaranteed convergence in complex settings.
Exploration Difficulties
Sparse or deceptive rewards hinder effective exploration. Balancing exploration-exploitation remains an open problem.
Recent Advances
Meta Reinforcement Learning
Learning to learn: agents adapt quickly to new tasks by leveraging previous experience. Methods include model-agnostic meta-learning (MAML).
Multi-Agent Reinforcement Learning
Coordination, competition, communication among multiple agents. Applications in decentralized control and games.
Safe and Explainable RL
Incorporating constraints and interpretability into RL policies. Ensuring reliability in real-world deployment.
Key Formulas and Pseudocode
Bellman Optimality Equation
V*(s) = max_a ∑_{s',r} P(s',r|s,a) [r + γ V*(s')]Q-learning Update Rule
Q(s_t,a_t) ← Q(s_t,a_t) + α [r_{t+1} + γ max_a Q(s_{t+1},a) - Q(s_t,a_t)]Policy Gradient Theorem (simplified)
∇_θ J(θ) = E_π [∇_θ log π_θ(a|s) * Q^π(s,a)]Q-learning Algorithm (Pseudocode)
Initialize Q(s,a) arbitrarilyFor each episode: Initialize state s Repeat until terminal: Choose action a from s using policy derived from Q (e.g., ε-greedy) Take action a, observe reward r and next state s' Q(s,a) ← Q(s,a) + α [r + γ max_{a'} Q(s',a') - Q(s,a)] s ← s'References
- R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed., MIT Press, 2018, pp. 1-552.
- V. Mnih et al., "Human-level control through deep reinforcement learning," Nature, vol. 518, 2015, pp. 529-533.
- J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, "Trust Region Policy Optimization," ICML, 2015, pp. 1889-1897.
- L. P. Kaelbling, M. L. Littman, and A. W. Moore, "Reinforcement Learning: A Survey," J. Artificial Intelligence Research, vol. 4, 1996, pp. 237-285.
- M. Hausknecht and P. Stone, "Deep Reinforcement Learning in Parameterized Action Space," ICML Workshop on Deep Reinforcement Learning, 2016, pp. 1-6.