Artificial Intelligence Machine Learning Reinforcement Learning

Introduction

Reinforcement learning (RL) is learning to maximize cumulative reward through interaction with an environment. Unlike supervised learning (which has target labels), RL receives scalar rewards and must discover which actions led to good outcomes. Agent takes actions, observes new state and reward, learns policy mapping states to actions that maximize long-term reward.

Core insight: learning problem decomposed into sequences of decisions. Each decision's value depends not just on immediate reward but on consequences for future rewards. Temporal credit assignment: which actions deserve credit for distant future rewards?

Applications: game playing (AlphaGo defeated Lee Sedol using RL + MCTS), robot control (learning manipulation tasks), autonomous vehicles, resource optimization, recommendation systems. Challenges: sample efficiency (learning from limited experience), exploration (discovering effective strategies), stability (training can diverge).

"Reinforcement learning is learning what to do,how to map situations to actions,so as to maximize a numerical reward signal. The learner is not told which actions to take, but instead must discover which actions yield highest rewards." -- Richard Sutton, Reinforcement Learning pioneer

Problem Formulation: Markov Decision Processes

MDP Definition

Formalize RL problem as Markov Decision Process (MDP): tuple (S, A, P, R, gamma). S = state space. A = action space. P = transition probability P(s'|s,a). R = reward function R(s,a,s'). Gamma = discount factor (0 to 1) weighing future rewards.

Markov property: future depends only on current state, not history. State summarizes all relevant past information.

MDP Components:- State s: representation of environment (position, velocity, board configuration)- Action a: agent's choice (move, jump, push)- Transition: P(s'|s,a) probability of reaching s' from s via a- Reward: R(s,a,s') immediate numerical feedback- Discount gamma: 0 < gamma < 1, future rewards worth less than immediate

Return and Objective

Return G_t = sum of discounted future rewards: R_t + gamma*R_t+1 + gamma^2*R_t+2 + ... Objective: find policy pi(a|s) maximizing expected return E[G_t].

Episodic vs. Continuing Tasks

Episodic: interaction terminates (chess game ends, episode completes). Continuing: infinite horizon (robot walking). For continuing, discount factor prevents infinite return.

Value Functions and Bellman Equations

State Value Function

V(s) = expected return from state s under policy pi. Answers: "How good is this state?"

V_pi(s) = E[G_t | S_t = s] = E[R_t + gamma*V_pi(s') | S_t = s]Can be computed recursively: value of state is immediate reward plus discounted value of next state

Action Value Function (Q-function)

Q(s,a) = expected return from taking action a in state s then following policy pi. Answers: "How good is this action in this state?"

Q_pi(s,a) = E[G_t | S_t = s, A_t = a] = E[R_t + gamma*V_pi(s')]Key difference: V depends on state only, Q depends on state AND action

Bellman Equations

Recursive relationships enabling computation. State value decomposes into immediate reward plus discounted value of successor:

V_pi(s) = sum_a pi(a|s) * sum_s',r P(s',r|s,a) * [r + gamma*V_pi(s')]Q_pi(s,a) = sum_s',r P(s',r|s,a) * [r + gamma*sum_a' pi(a'|s') * Q_pi(s',a')]Optimal value: V*(s) = max_a E[R_t + gamma*V*(s') | A_t = a]

Policy Evaluation

Given policy, iteratively compute V: V_k+1(s) = sum_a pi(a|s) sum_s',r P(s',r|s,a) [r + gamma*V_k(s')]. Converges to true V_pi.

Dynamic Programming Methods

Policy Iteration

Alternately evaluate and improve policy until convergence:

1. Policy Evaluation: compute V for current policy2. Policy Improvement: pi'(s) = argmax_a Q(s,a)3. Repeat until policy stableGuaranteed convergence to optimal policy

Drawback: requires full model P(s',r|s,a). Most RL problems don't have explicit model.

Value Iteration

Combine evaluation and improvement in single step:

V_k+1(s) = max_a sum_s',r P(s',r|s,a) [r + gamma*V_k(s')]More efficient than policy iteration. Converges to optimal V*

When DP Applies

DP requires: (1) full knowledge of environment model, (2) computational capacity to process all states. Neither typically true in practice. RL methods relax these requirements.

Temporal Difference Learning

Core Idea

Learn from incomplete episodes. Don't wait for episode to end,update values using bootstrapping: estimate future value from current V estimate. Combines advantages of DP (model-free) and Monte Carlo (learns from experience).

TD(0) Algorithm

For each step: delta = R_t + gamma*V(S_t+1) - V(S_t) [TD error] V(S_t) <- V(S_t) + alpha*delta [update toward bootstrapped target]Delta measures how far current estimate was from (reward + discounted next value)

Advantages Over Monte Carlo

MC must wait for episode end to update. TD updates immediately using bootstrapped value. Reduces variance (uses estimate not full return). Can learn online, from incomplete episodes.

TD Lambda and Eligibility Traces

TD(lambda) blends n-step returns. Lambda=0 is TD(0), lambda=1 approaches Monte Carlo. Eligibility traces track which states were visited recently,useful for credit assignment in long episodes.

e(s) = lambda*gamma*e(s)_previous + 1 [eligibility decays, increments on visit]V(s) <- V(s) + alpha*delta*e(s) [update proportional to eligibility]

Q-Learning and Deep Q-Networks

Q-Learning Algorithm

Off-policy method: learn optimal policy while following exploratory policy. Uses greedy bootstrap estimate:

Q(S_t,A_t) <- Q(S_t,A_t) + alpha*[R_t + gamma*max_a Q(S_t+1,a) - Q(S_t,A_t)] ^^^^^^^^ greedy target (optimal action value)

Off-policy: agent explores (takes random actions), but updates target uses max Q (optimal action). Decouples behavior policy (exploration) from target policy (greedy).

Convergence and Guarantees

Converges to optimal Q* under conditions: (1) all states visited infinitely often, (2) learning rate decays properly. For practical problems with function approximation, convergence not guaranteed but often works.

Deep Q-Networks (DQN)

Use neural network to approximate Q(s,a). Network takes state, outputs action values. Enables scaling to high-dimensional states (images, complex observations).

DQN Innovations

Technique	Problem Solved	Mechanism
Experience Replay	Correlation between samples	Store transitions, sample minibatch from replay buffer
Target Network	Non-stationary targets (moving goalpost)	Separate network for bootstrap target, update periodically
Dueling DQN	Separation of value and advantage	Network outputs V(s) and A(s,a), combines as Q = V + A
Double DQN	Overestimation of Q-values	Decouple action selection from evaluation

Limitations

DQN struggles with: continuous actions (Q(s,a) requires discrete action space), off-policy inefficiency (data from old policy), sample complexity (requires millions of interactions).

Policy Gradient Methods

Core Concept

Instead of learning V or Q, directly learn policy theta = pi_theta(a|s). Parameterize policy with neural network. Gradient ascent on expected return:

theta <- theta + alpha*nabla_theta E[G_t]Policy gradient theorem: nabla_theta E[G_t] = E[nabla_theta log pi_theta(A_t|S_t) * Q(S_t,A_t)]Higher Q(s,a) actions have gradients reinforced; lower Q actions suppressed

REINFORCE Algorithm

For each trajectory: For each step t: theta <- theta + alpha*nabla_theta log pi_theta(A_t|S_t) * G_tUses actual return G_t as credit signal. High-variance but unbiased.

Advantages Over Q-Learning

Handles continuous actions naturally. On-policy (follows learned policy). More stable when trained properly. Directly optimizes objective (expected return).

Actor-Critic Baseline

Reduce variance by subtracting baseline b(s) (typically V(s)):

theta <- theta + alpha*nabla_theta log pi_theta(A_t|S_t) * (G_t - V(S_t))Advantage A(s,a) = G_t - V(s) shows if action better/worse than average. Reduces variance without bias.

Actor-Critic Methods

Architecture

Two networks: Actor (policy pi) and Critic (value V). Actor selects actions, Critic evaluates. Critic's value estimates guide actor's learning.

Actor: theta_pi, policy pi_theta(a|s)Critic: theta_v, value V_theta(s)Actor update: theta_pi <- theta_pi + alpha_pi * nabla_theta log pi(A_t|S_t) * (R_t + gamma*V(S_t+1) - V(S_t))Critic update: theta_v <- theta_v + alpha_v * (R_t + gamma*V(S_t+1) - V(S_t))^2

Advantage Algorithms (A3C, PPO)

Asynchronous Advantage Actor-Critic: parallel actors accumulate experience, periodically update shared network. Reduces correlation, stabilizes training.

Proximal Policy Optimization (PPO): clipped objective prevents large policy changes:

L(theta) = E[min(r_t * A_t, clip(r_t, 1-epsilon, 1+epsilon) * A_t)]r_t = pi_new(a|s) / pi_old(a|s) [probability ratio]Prevents excessively large updates. Popular, sample-efficient, stable.

Trust Region Methods

TRPO (Trust Region Policy Optimization): constrain policy updates to trust region where old approximation valid. Guarantees monotonic improvement. Computationally expensive but theoretically sound.

Exploration-Exploitation Trade-off

Core Dilemma

Exploitation: use current best-known action. Exploration: try unknown actions seeking better rewards. Pure exploitation misses better strategies. Pure exploration wastes time on suboptimal actions. Balance critical for learning.

Epsilon-Greedy

With probability epsilon, take random action (explore). With probability 1-epsilon, take greedy action (exploit). Simple, effective. Epsilon typically decays over time.

Upper Confidence Bound (UCB)

Explore actions with high uncertainty. Select action a = argmax_a [Q(a) + c*sqrt(ln(N)/N(a))]

First term: exploitation (high Q value). Second term: exploration (high uncertainty, N(a) = times action tried). Balances automatically based on value and uncertainty.

Thompson Sampling

Maintain distribution over Q-values. Sample from distribution, act greedily w.r.t. sample. Naturally balances exploration-exploitation. Theoretically optimal.

Curiosity-Driven Exploration

Intrinsic motivation: reward prediction error. Agent bonuses for states where model uncertain (high prediction error). Discovers interesting states, accelerates learning in sparse-reward environments.

Applications and Breakthroughs

Game Playing

AlphaGo (2016): combination of deep networks, Monte Carlo tree search, and RL beat Lee Sedol in Go (400 trillion positions). AlphaZero learned chess, shogi, Go from scratch, surpassing human champions.

Robot Control

Learn manipulation (reaching, grasping, in-hand reorientation). Learning from sim-to-real: policy trained in simulation transfers to real robots (sometimes). Challenges: sample efficiency, safety.

Autonomous Vehicles

RL for motion planning, trajectory optimization, interaction with other agents. Most systems use RL combined with classical planning.

Resource Optimization

Data center cooling, network routing, power systems. RL learns adaptive control policies, often outperforming hand-tuned heuristics.

Recommendation Systems

Bandit algorithms (contextual bandits) balance exploration and exploitation. Learn which items to recommend, treating user feedback as reward signal.

Challenges and Open Problems

Sample Efficiency

Deep RL typically requires millions of interactions. Humans learn from dozens. Challenges: high-dimensional action spaces, long horizons requiring planning, sparse rewards. Research: model-based RL, imitation learning, meta-learning.

Reward Specification

Designing reward function difficult. Rewards should incentivize desired behavior but often encode unintended solutions (reward hacking). Example: agent exploits glitch for high score rather than intended gameplay.

Non-Stationary Environments

Real environments change. Agent must adapt continually. Catastrophic forgetting: learning new task erases old knowledge. Research: multi-task learning, continual learning.

Exploration in Complex Spaces

Random exploration fails in high-dimensional spaces. Needles-in-haystack problem: good rewards extremely rare. Solutions: better intrinsic motivation, hierarchical RL, curriculum learning.

Partial Observability

Agent sees partial state (observations not full Markov state). Requires memory (RNNs). POMDPs harder than MDPs. Theoretical understanding incomplete.

Generalization

Policies trained on one domain often fail on similar variants. Transfer learning in RL underdeveloped. Real progress requires better abstraction, hierarchical representations.

References

Sutton, R. S., and Barto, A. G. "Reinforcement Learning: An Introduction." MIT Press, 2nd edition, 2018.
Mnih, V., Kavukcuoglu, K., Silver, D., et al. "Playing Atari with Deep Reinforcement Learning." NIPS, 2013.
Silver, D., Schrittwieser, J., Simonyan, K., et al. "Mastering the Game of Go without Human Knowledge." Nature, vol. 550, 2017.
Schulman, J., Wolski, F., Dhariwal, P., et al. "Proximal Policy Optimization Algorithms." arXiv:1707.06347, 2017.
Lillicrap, T., Hunt, J. J., Pritzel, A., et al. "Continuous Control with Deep Reinforcement Learning." ICLR, 2016.