REINFORCE

Definition

REINFORCE is a Monte Carlo policy gradient algorithm that directly implements the Policy Gradient Theorem. It updates policy parameters by sampling complete episodes (trajectories) and using the discounted return as a gradient weight.

The update rule:

$θ \leftarrow θ + α \sum_{t = 0}^{T - 1} \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) \cdot G (τ)$

where $G (τ) = \sum_{t = 0}^{T - 1} γ^{t} r_{t}$ is the return for the entire episode.

Historical Significance

REINFORCE (Williams, 1992) was the first practical implementation of policy gradients. It:

Proved policy gradients were feasible
Avoided value function learning
Provided unbiased gradient estimates
Formed the basis for modern policy gradient methods (Actor-Critic, PPO, etc.)

Intuition

Core Idea

Sample trajectories from the current policy and update it to make high-return episodes more likely:

Roll out an episode from current policy $π_{θ}$
Compute total return $G = \sum r_{t}$
For each step, increase log-probability of that action weighted by the episode return
Repeat

Why It Works

Gradient ascent: Each update increases expected return in the direction of the policy gradient
Unbiased: The gradient estimate has the correct expectation
Model-free: Doesn’t require knowing the environment dynamics
General: Works with any differentiable policy parameterization

Why It’s Limited

High variance: Uses full episode return (compounds over time)
Slow: Needs many episodes for reliable estimates
Episodic only: Requires complete episodes (continuing tasks need horizons)
Credit assignment: All actions share credit/blame for entire trajectory

Algorithm

Pseudocode

Initialize policy parameters θ
Repeat:
  τ ← sample episode from π_θ
  G ← return of τ = Σ γ^t r_t
  θ ← θ + α · G · Σ ∇_θ log π_θ(a_t|s_t)

Batch Version (Clearer for Implementation)

Initialize policy parameters θ
Repeat:
  D ← sample N episodes under π_θ
  For i = 1 to N:
    G_i ← return of episode i
    ∇_i ← Σ ∇_θ log π_θ(a_{i,t}|s_{i,t})
  θ ← θ + (α/N) Σ G_i · ∇_i

Continuous Action Example

For a Gaussian policy $a \sim \mathcal{N}}(\mu_\theta(s), \sigma)$ :

Sample action: $a_{t} \sim N (μ_{θ} (s_{t}), σ)$
Gradient: $\nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) = \frac{1}{σ ^{2}} (a_{t} - μ_{θ} (s_{t})) \nabla_{θ} μ_{θ} (s_{t})$
Update: $μ \leftarrow μ + α G \cdot \frac{1}{σ ^{2}} (a - μ) \nabla μ$

Discrete Action Example

For a softmax policy $π_{θ} (a ∣ s) = \frac{e x p f _{θ} ( s , a )}{\sum _{a} e x p f _{θ} ( s , a )}$ :

Sample action from softmax
Gradient: $\nabla_{θ} lo g π_{θ} (a ∣ s) = \nabla_{θ} f_{θ} (s, a) - E [\nabla_{θ} f_{θ} (s, a^{'})]$
Update policy accordingly

Variants and Improvements

REINFORCE with Baseline

Reduces variance by subtracting a learned value function:

$θ \leftarrow θ + α \sum_{t} \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) \cdot (G - V (s_{t}))$

Unbiased: Value function doesn’t depend on actions
Lower variance: Compares to expected return from state
Practical improvement: Typically needed for good performance

REINFORCE v2 (Causality)

Only uses forward returns (return from time $t$ onward):

$θ \leftarrow θ + α \sum_{t = 0}^{T - 1} \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) \cdot G_{t}$

where $G_{t} = \sum_{t^{'} = t}^{T - 1} γ^{t^{'} - t} r_{t^{'}}$

Intuition: Action $a_{t}$ can’t affect rewards before time $t$
Variance reduction: Removes unnecessary noise from past
Still unbiased: Doesn’t change expectation

With Advantages

Use advantage function $A (s, a) = Q (s, a) - V (s)$ from learned Actor-Critic setup:

$θ \leftarrow θ + α \sum_{t} \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) \cdot A (s_{t}, a_{t})$

Maximum flexibility: Can approximate $Q$ and $V$ separately
Modern form: Basis of A2C, PPO, etc.

Properties

Strengths

✓ Unbiased: Converges to stationary points of $J (θ)$ ✓ Consistent: Sample average converges to true gradient ✓ General: Works with any differentiable policy ✓ Model-free: Needs only policy samples, not dynamics ✓ Simple: Easy to implement and understand ✓ Handles stochasticity: Works with stochastic optimal policies

Weaknesses

✗ High variance: Full return has high variance ✗ Sample inefficiency: Needs many episodes ✗ Slow convergence: Can be slower than value-based methods ✗ Episodic: Requires complete episodes ✗ Credit assignment: Actions credited equally for entire trajectory ✗ Deterministic policies: Can’t learn truly deterministic optimal policies

Practical Considerations

Gradient Magnitude Issues

Returns $G$ can vary wildly (positive/negative, large magnitudes):

Solution 1: Normalize returns: $(G - \overset{ˉ}{G}) / σ_{G}$
Solution 2: Use advantage: $G - V (s)$ (learned baseline)
Solution 3: Reduce baseline per-state: $G - V (s_{0})$ (constant)

Step Size Tuning

Learning rate $α$ is critical (0.01 to 0.001 typical)
Too high: Divergence
Too low: Very slow learning
Often use adaptive learning rates (Adam, RMSprop)

Variance Reduction in Practice

Order of importance:

Baseline (most important): Reduces variance dramatically
Causality (moderate): Cuts one source of variance
Normalization (helpful): Stabilizes learning
Entropy regularization (optional): Encourages exploration

Connections

Implements: Policy Gradient Theorem
Foundation for: Actor-Critic, A2C, A3C
Related to: Monte Carlo Methods, Gradient Ascent
Uses: Policy parameterization (softmax, Gaussian, etc.)
Requires: Differentiable policy

Modern Context

REINFORCE is largely superseded by more advanced algorithms (PPO, A3C), but:

Still the simplest policy gradient algorithm
Educational value: teaches core principles
Effective with proper baselines and variance reduction
Used in some simple domains

Appears In

Policy Gradient Methods — Foundational algorithm
Actor-Critic — Extensions with learned value
Advantage Actor-Critic (A2C) — Direct successor
Policy Gradient Methods — Core RL course topic
Deep Reinforcement Learning — When optimizing neural network policies

Study Notes

Explorer

REINFORCE

REINFORCE

Definition

Historical Significance

Intuition

Core Idea

Why It Works

Why It’s Limited

Algorithm

Pseudocode

Batch Version (Clearer for Implementation)

Continuous Action Example

Discrete Action Example

Variants and Improvements

REINFORCE with Baseline

REINFORCE v2 (Causality)

With Advantages

Properties

Strengths

Weaknesses

Practical Considerations

Gradient Magnitude Issues

Step Size Tuning

Variance Reduction in Practice

Connections

Modern Context

Appears In

Graph View

Table of Contents

Backlinks