Reinforcement Learning | What's Your IQ

Definition and Scope

Overview

Reinforcement Learning (RL): area of machine learning focused on agents learning optimal actions by maximizing cumulative reward. Interaction-based: agent observes state, takes action, receives feedback. Model-free or model-based approaches. Goal: learn policy mapping states to actions for maximum return.

Core Elements

Agent: learner or decision maker. Environment: external system reacting to actions. State: environment snapshot at time t. Action: choice made by agent. Reward: scalar feedback signal. Policy: strategy used by agent. Value function: expected reward estimation.

Scope

Applications: robotics, game playing, autonomous systems, finance. Relations: overlaps with control theory, supervised learning, unsupervised learning. Distinct by sequential decision focus and delayed reward optimization.

"Reinforcement learning enables agents to achieve goals in complex, uncertain environments by learning from experience." -- Richard S. Sutton & Andrew G. Barto

Markov Decision Processes

Definition

MDP: formal framework modeling RL problems. Components: set of states S, set of actions A, transition probability P(s'|s,a), reward function R(s,a), discount factor γ ∈ [0,1]. Markov property: future state depends only on current state and action.

Mathematical Formulation

At time t: state s_t, action a_t, next state s_{t+1} ~ P(·|s_t,a_t), reward r_t = R(s_t,a_t). Objective: maximize expected discounted return G_t = ∑_{k=0}^∞ γ^k r_{t+k}.

Types of MDPs

Finite vs. infinite state/action spaces. Episodic vs. continuing tasks. Deterministic or stochastic transitions. Partially observable MDPs (POMDPs): states not fully visible, require belief states.

Value Functions

State-Value Function (V)

Definition: expected return starting from state s following policy π. Expressed as V^π(s) = E_π[G_t | s_t = s]. Measures desirability of states under policy.

Action-Value Function (Q)

Definition: expected return starting from state s, taking action a, thereafter following π. Q^π(s,a) = E_π[G_t | s_t=s, a_t=a]. Core to many algorithms for policy improvement.

Bellman Equations

Recursive relationships defining value functions. For V^π: V^π(s) = ∑_a π(a|s) ∑_{s',r} P(s',r|s,a)[r + γ V^π(s')]. For optimal values V*, Q* use max over actions.

Value Function	Definition
State-Value V^π(s)	Expected return starting at state s under policy π
Action-Value Q^π(s,a)	Expected return starting at state s, taking action a, then following π

Policy and Policy Optimization

Policy Types

Deterministic: maps states to single actions. Stochastic: maps states to probability distributions over actions. Parametric policies often optimized directly.

Policy Evaluation

Process of computing value functions for a given policy. Methods: iterative policy evaluation, Monte Carlo estimation, temporal difference learning.

Policy Improvement

Using value functions to derive improved policies. Policy iteration alternates evaluation and improvement until convergence. Guarantees monotonic improvement under certain conditions.

Exploration vs. Exploitation

Trade-Off

Exploration: trying new actions to discover better rewards. Exploitation: selecting known best actions to maximize reward. Balancing critical for effective learning.

Strategies

Epsilon-greedy: selects random action with probability ε, best action otherwise. Softmax: probabilistic selection weighted by action values. Upper Confidence Bound (UCB): optimism in face of uncertainty.

Impact on Convergence

Insufficient exploration: suboptimal policies. Excessive exploration: slow convergence. Optimal schedules often decay exploration rate over time.

Core Learning Algorithms

Dynamic Programming (DP)

Requires full knowledge of MDP. Uses Bellman backups to compute value functions. Examples: value iteration, policy iteration. Complexity: scales poorly with large state spaces.

Monte Carlo Methods

Learn from complete episodes without model knowledge. Estimate returns by averaging sampled episodes. Suitable for episodic tasks. Requires episodes to terminate.

Temporal Difference (TD) Learning

Combines Monte Carlo and DP advantages. Updates estimates based on bootstrapping from current estimate. Examples: TD(0), SARSA, Q-learning. Model-free and online.

Algorithm	Characteristics
Value Iteration	Dynamic Programming, requires MDP model, iterative Bellman updates
Monte Carlo	Model-free, episodic, averages sampled returns
Q-learning	Model-free, off-policy TD control algorithm

Function Approximation

Motivation

Exact representation infeasible in large/continuous state or action spaces. Approximate value functions or policies using parametrized models.

Types of Approximators

Linear functions: weighted sums of features. Nonlinear functions: neural networks, decision trees, kernel methods. Trade-offs: bias-variance, interpretability.

Challenges

Stability: divergence risks with bootstrapping and nonlinear function approximators. Sample efficiency: more data needed. Techniques: experience replay, target networks.

Deep Reinforcement Learning

Introduction

Integration of deep neural networks with RL to handle high-dimensional inputs and complex policies. Enables end-to-end learning from raw sensory data.

Notable Algorithms

Deep Q-Networks (DQN): value-based, uses CNNs for visual inputs. Policy Gradient Methods: directly optimize parametrized policies. Actor-Critic: combines value and policy learning.

Techniques for Stability

Experience Replay: stores past transitions for randomized updates. Target Networks: stabilize Q-value targets. Reward clipping, normalization, and entropy regularization.

Applications

Robotics

Autonomous control, manipulation, locomotion. Learning motor skills through trial and error. Adaptation to dynamic environments.

Game Playing

Classic: chess, Go, backgammon. Recent: real-time strategy games, Atari. Achieved human-level or superhuman performance.

Finance and Operations

Portfolio optimization, algorithmic trading, resource allocation. Dynamic decision-making under uncertainty.

Challenges and Limitations

Sample Inefficiency

Large data requirements due to trial-and-error paradigm. Expensive in real-world systems where actions have cost or risk.

Stability and Convergence

Non-stationary targets, function approximation instability. Lack of guaranteed convergence in complex settings.

Exploration Difficulties

Sparse or deceptive rewards hinder effective exploration. Balancing exploration-exploitation remains an open problem.

Recent Advances

Meta Reinforcement Learning

Learning to learn: agents adapt quickly to new tasks by leveraging previous experience. Methods include model-agnostic meta-learning (MAML).

Multi-Agent Reinforcement Learning

Coordination, competition, communication among multiple agents. Applications in decentralized control and games.

Safe and Explainable RL

Incorporating constraints and interpretability into RL policies. Ensuring reliability in real-world deployment.

Key Formulas and Pseudocode

Bellman Optimality Equation

V*(s) = max_a ∑_{s',r} P(s',r|s,a) [r + γ V*(s')]

Q-learning Update Rule

Q(s_t,a_t) ← Q(s_t,a_t) + α [r_{t+1} + γ max_a Q(s_{t+1},a) - Q(s_t,a_t)]

Policy Gradient Theorem (simplified)

∇_θ J(θ) = E_π [∇_θ log π_θ(a|s) * Q^π(s,a)]

Q-learning Algorithm (Pseudocode)

Initialize Q(s,a) arbitrarilyFor each episode: Initialize state s Repeat until terminal: Choose action a from s using policy derived from Q (e.g., ε-greedy) Take action a, observe reward r and next state s' Q(s,a) ← Q(s,a) + α [r + γ max_{a'} Q(s',a') - Q(s,a)] s ← s'

References

R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed., MIT Press, 2018, pp. 1-552.
V. Mnih et al., "Human-level control through deep reinforcement learning," Nature, vol. 518, 2015, pp. 529-533.
J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, "Trust Region Policy Optimization," ICML, 2015, pp. 1889-1897.
L. P. Kaelbling, M. L. Littman, and A. W. Moore, "Reinforcement Learning: A Survey," J. Artificial Intelligence Research, vol. 4, 1996, pp. 237-285.
M. Hausknecht and P. Stone, "Deep Reinforcement Learning in Parameterized Action Space," ICML Workshop on Deep Reinforcement Learning, 2016, pp. 1-6.