REINFORCE

Definition

REINFORCE is a Monte Carlo policy gradient algorithm that directly implements the Policy Gradient Theorem. It updates policy parameters by sampling complete episodes (trajectories) and using the discounted return as a gradient weight.

The update rule:

where is the return for the entire episode.

Historical Significance

REINFORCE (Williams, 1992) was the first practical implementation of policy gradients. It:

  • Proved policy gradients were feasible
  • Avoided value function learning
  • Provided unbiased gradient estimates
  • Formed the basis for modern policy gradient methods (Actor-Critic, PPO, etc.)

Intuition

Core Idea

Sample trajectories from the current policy and update it to make high-return episodes more likely:

  1. Roll out an episode from current policy
  2. Compute total return
  3. For each step, increase log-probability of that action weighted by the episode return
  4. Repeat

Why It Works

  • Gradient ascent: Each update increases expected return in the direction of the policy gradient
  • Unbiased: The gradient estimate has the correct expectation
  • Model-free: Doesn’t require knowing the environment dynamics
  • General: Works with any differentiable policy parameterization

Why It’s Limited

  • High variance: Uses full episode return (compounds over time)
  • Slow: Needs many episodes for reliable estimates
  • Episodic only: Requires complete episodes (continuing tasks need horizons)
  • Credit assignment: All actions share credit/blame for entire trajectory

Algorithm

Pseudocode

Initialize policy parameters θ
Repeat:
  τ ← sample episode from π_θ
  G ← return of τ = Σ γ^t r_t
  θ ← θ + α · G · Σ ∇_θ log π_θ(a_t|s_t)

Batch Version (Clearer for Implementation)

Initialize policy parameters θ
Repeat:
  D ← sample N episodes under π_θ
  For i = 1 to N:
    G_i ← return of episode i
    ∇_i ← Σ ∇_θ log π_θ(a_{i,t}|s_{i,t})
  θ ← θ + (α/N) Σ G_i · ∇_i

Continuous Action Example

For a Gaussian policy a \sim \mathcal{N}}(\mu_\theta(s), \sigma):

  1. Sample action:
  2. Gradient:
  3. Update:

Discrete Action Example

For a softmax policy :

  1. Sample action from softmax
  2. Gradient:
  3. Update policy accordingly

Variants and Improvements

REINFORCE with Baseline

Reduces variance by subtracting a learned value function:

  • Unbiased: Value function doesn’t depend on actions
  • Lower variance: Compares to expected return from state
  • Practical improvement: Typically needed for good performance

REINFORCE v2 (Causality)

Only uses forward returns (return from time onward):

where

  • Intuition: Action can’t affect rewards before time
  • Variance reduction: Removes unnecessary noise from past
  • Still unbiased: Doesn’t change expectation

With Advantages

Use advantage function from learned Actor-Critic setup:

  • Maximum flexibility: Can approximate and separately
  • Modern form: Basis of A2C, PPO, etc.

Properties

Strengths

Unbiased: Converges to stationary points of Consistent: Sample average converges to true gradient ✓ General: Works with any differentiable policy ✓ Model-free: Needs only policy samples, not dynamics ✓ Simple: Easy to implement and understand ✓ Handles stochasticity: Works with stochastic optimal policies

Weaknesses

High variance: Full return has high variance ✗ Sample inefficiency: Needs many episodes ✗ Slow convergence: Can be slower than value-based methods ✗ Episodic: Requires complete episodes ✗ Credit assignment: Actions credited equally for entire trajectory ✗ Deterministic policies: Can’t learn truly deterministic optimal policies

Practical Considerations

Gradient Magnitude Issues

Returns can vary wildly (positive/negative, large magnitudes):

  • Solution 1: Normalize returns:
  • Solution 2: Use advantage: (learned baseline)
  • Solution 3: Reduce baseline per-state: (constant)

Step Size Tuning

  • Learning rate is critical (0.01 to 0.001 typical)
  • Too high: Divergence
  • Too low: Very slow learning
  • Often use adaptive learning rates (Adam, RMSprop)

Variance Reduction in Practice

Order of importance:

  1. Baseline (most important): Reduces variance dramatically
  2. Causality (moderate): Cuts one source of variance
  3. Normalization (helpful): Stabilizes learning
  4. Entropy regularization (optional): Encourages exploration

Connections

Modern Context

REINFORCE is largely superseded by more advanced algorithms (PPO, A3C), but:

  • Still the simplest policy gradient algorithm
  • Educational value: teaches core principles
  • Effective with proper baselines and variance reduction
  • Used in some simple domains

Appears In