Return

Definition

Return

The return $G_{t}$ is the total accumulated reward from time step $t$ onward. It is the quantity that RL agents seek to maximize (in expectation).

Return (Discounted)

$G_{t} = R_{t + 1} + γ R_{t + 2} + γ^{2} R_{t + 3} + \dots = \sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1}$

where:

$R_{t + k + 1}$ — reward received $k + 1$ steps after time $t$

$γ \in [0, 1]$ — Discount Factor

Variants

Episodic (undiscounted or discounted): $G_{t} = \sum_{k = 0}^{T - t - 1} γ^{k} R_{t + k + 1}$

where $T$ is the terminal time step. With $γ = 1$ , this is just the sum of all remaining rewards.

Continuing (must have $γ < 1$ ): $G_{t} = \sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1}$

Converges as long as $γ < 1$ and rewards are bounded.

Recursive Property

Recursive Return

$G_{t} = R_{t + 1} + γ G_{t + 1}$

This is the key recursive relationship that enables Bootstrapping and the Bellman Equation.

Why This Matters

You don’t need to compute the entire sum from scratch. The return at time $t$ equals the immediate reward plus the discounted return from the next step. This decomposition is the foundation of Dynamic Programming and Temporal Difference Learning.

Role in RL Methods

Monte Carlo Methods: Estimate $v_{π} (s)$ by averaging actual returns $G_{t}$ observed after visiting state $s$
Temporal Difference Learning: Approximates $G_{t}$ with $R_{t + 1} + γV (S_{t + 1})$ (one-step bootstrap)
Value Function: Defined as the expected return: $v_{π} (s) = E_{π} [G_{t} ∣ S_{t} = s]$

Connections

Used to define: Value Function
Discounted by: Discount Factor
Estimated by: Monte Carlo Methods (full), Temporal Difference Learning (bootstrapped)
Recursive structure enables: Bellman Equation

Appears In

RL-L01 - Intro, MDPs & Bandits, RL-L03 - Monte Carlo Methods, RL-L04 - Temporal Difference Learning

Study Notes

Explorer

Return

Return

Definition

Variants

Recursive Property

Role in RL Methods

Connections

Appears In

Graph View

Table of Contents

Backlinks