Trust Region Policy Optimization (TRPO)

Definition

Trust Region Policy Optimization

TRPO is an on-policy policy gradient algorithm that improves the policy by maximizing a surrogate objective (the expected advantage under the new policy, weighted by an importance ratio) subject to a hard constraint that the new policy stays close to the old one in average KL divergence. The KL constraint defines a “trust region” inside which the surrogate is a reliable approximation of the true performance, guaranteeing monotonic (non-decreasing) policy improvement in the idealized case.

Intuition

Vanilla policy gradient takes a fixed-size step in parameter space (). The problem: a small change in can cause a huge change in the actual policy distribution, collapsing performance. Because policy gradient is on-policy, one bad update poisons all future data — there is no recovery from off-policy replay.

TRPO fixes this by measuring step size in policy space instead of parameter space. It asks: “How far can I move the policy and still trust my data-derived estimate of improvement?” The answer is bounded by the KL divergence between old and new policy. Stay inside the trust region , and the surrogate objective faithfully predicts the true return; the update is then guaranteed not to make things worse.

Why a Constraint Instead of a Penalty

Theory (the MM / minorize-maximize bound) actually suggests a KL penalty with a coefficient tied to the max advantage. But that coefficient forces tiny, conservative steps. TRPO instead uses a hard KL constraint as a robust, tunable proxy that allows much larger practical steps while keeping the trust-region guarantee.

Mathematical Formulation

The surrogate objective. TRPO maximizes the expected advantage of the new policy , with state visitation taken from the old policy and actions reweighted by an importance ratio:

The constrained optimization problem solved at each iteration:

where:

  • importance ratio correcting for the fact that data was collected under but we score
  • advantage under the old policy (estimated in practice by GAE)
  • — discounted state-visitation distribution under the old policy
  • average KL over visited states (the max-KL version is intractable)
  • — trust-region radius (typical: )

Solving it (the natural-gradient connection). Linearize the objective and quadratically approximate the constraint around :

where is the policy gradient and is the Fisher Information Matrix (the Hessian of the KL constraint). The closed-form solution is the natural gradient step scaled to fill the trust region:

where:

  • — the natural gradient direction (search direction obtained by solving with conjugate gradient, never forming explicitly)
  • maximal step size along that keeps the quadratic KL at exactly

Monotonic Improvement Bound

TRPO is derived from a guarantee on the true return : with . Maximizing the right-hand side (a minorant of the true return) cannot decrease — this is the monotonic-improvement property. TRPO replaces the penalty with the trust-region constraint for larger, practical steps.

Key Properties / Variants

  • On-policy. Fresh trajectories are collected from each iteration; the importance ratio is only a local correction valid near (inside the trust region).
  • Second-order / natural-gradient method. The Fisher matrix is the curvature of policy space; TRPO is essentially Natural Policy Gradient with an automatically chosen step size plus safeguards.
  • Hessian-free. Conjugate gradient solves using only Fisher-vector products ( via a double back-prop through the KL), avoiding the cost of inverting .
  • Backtracking line search enforces the true (non-approximated) constraint and a real improvement in the surrogate — protecting against errors from the quadratic approximation.
  • Monotonic improvement guarantee (in the exact setting), unlike vanilla policy gradient which can collapse.
  • Cost / drawbacks: the CG + line search machinery is complex and per-update expensive; it does not naturally share parameters between policy and value networks, and is awkward with architectures involving heavy noise/dropout.
  • Successor — Proximal Policy Optimization: replaces the hard KL constraint with a clipped surrogate ratio (or a soft KL penalty), recovering most of TRPO’s stability with first-order SGD and far simpler code. PPO is now the default; TRPO is the theoretical anchor.
Algorithm: Trust Region Policy Optimization (TRPO)
──────────────────────────────────────────────────
Input: initial policy θ, value/critic params w, trust radius δ
Loop for each iteration:
  1. Collect trajectories by running π_θ_old (θ_old ← θ) in the environment
  2. Estimate advantages  Â_t   (e.g., GAE with the critic V_w)
  3. Compute policy gradient   g = ∇_θ L(θ) |_θ_old
       L(θ) = mean_t [ (π_θ(a_t|s_t) / π_θ_old(a_t|s_t)) · Â_t ]
  4. Solve  F x = g  by conjugate gradient    (x ≈ F⁻¹ g, natural-grad direction)
       using Fisher-vector products  F v = ∇_θ ( (∇_θ KL)·v )   (no explicit F)
  5. Compute max step:  β = sqrt( 2δ / (xᵀ F x) )
  6. Backtracking line search over j = 0,1,2,...:
       θ_try = θ_old + (α^j) · β · x          (α ∈ (0,1), e.g. 0.5)
       accept first θ_try with  KL(π_θ_old ‖ π_θ_try) ≤ δ
                               and  L(θ_try) > L(θ_old)
       θ ← θ_try   (if none accepted, keep θ_old)
  7. Fit critic: minimize  Σ_t ( V_w(s_t) − R̂_t )²

Connections

Appears In