Study Notes

❯

❯

Bootstrapping

Jun 06, 20261 min read

foundations

Bootstrapping

Bootstrapping

In RL, bootstrapping means updating an estimate based partly on other estimates (rather than exclusively on actual observed values). The update target includes a current estimate of a value function.

Examples

Dynamic Programming: $V (s) \leftarrow \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) [r + γV (s^{'})]$ — uses $V (s^{'})$ , which is itself an estimate
TD(0): $V (S_{t}) \leftarrow V (S_{t}) + α [R_{t + 1} + γV (S_{t + 1}) - V (S_{t})]$ — uses $V (S_{t + 1})$
Monte Carlo Methods: $V (S_{t}) \leftarrow V (S_{t}) + α [G_{t} - V (S_{t})]$ — uses actual return $G_{t}$ , NOT bootstrapping

Trade-Off

Bootstrapping introduces bias (estimates are wrong initially) but reduces variance (don’t need to wait for the full noisy return). MC has zero bias but high variance. TD has some bias but much lower variance.

Role in the Deadly Triad

Bootstrapping is one of the three elements. Combined with Function Approximation and off-policy learning, it can cause divergence.

Appears In

RL-L04 - Temporal Difference Learning
RL-L05 - Tabular to Approximation

Graph View

Bootstrapping
Examples
Role in the Deadly Triad
Appears In

Backlinks

Deadly Triad
Return
TD3
Temporal Difference Learning
Upside-Down RL
RL-Book Ch11 - Off-Policy Methods with Approximation
RL-Book Ch13 - Policy Gradient Methods
RL-Book Ch5 - Monte Carlo Methods
RL-Book Ch6 - Temporal-Difference Learning
RL-Book Ch9 - On-Policy Prediction with Approximation
RL-L06 - On-Policy TD with Approximation
RL-L14 - Recap

Created with Quartz v4.5.2 © 2026

GitHub
Discord Community