Optimal Policy

Definition

Optimal Policy ( $π_{*}$ )

A policy $π$ is defined to be better than or equal to a policy $π^{'}$ if its expected return is greater than or equal to that of $π^{'}$ for all states. An optimal policy $π_{*}$ is any policy that is better than or equal to all other policies.

Key Properties

Existence: At least one optimal policy always exists for any Markov Decision Process (MDP).
Shared Value Function: All optimal policies share the same optimal state-value function $v_{*}$ and optimal action-value function $q_{*}$ .
Greedy Selection: Once $q_{*}$ is known, an optimal policy can be found by being greedy with respect to it: $π_{*} (s) = ar g max_{a} q_{*} (s, a)$
Uniqueness: While the value functions $v_{*}$ and $q_{*}$ are unique, the optimal policy $π_{*}$ may not be (e.g., if multiple actions share the same maximum value).

Mathematical Relation

Relationship to Optimal Value Function

$v_{*} (s) = max_{π} v_{π} (s) \forall s \in S$ $q_{*} (s, a) = max_{π} q_{π} (s, a) \forall s \in S, a \in A$

The Bellman Optimality Equation for $v_{*}$ is: $v_{*} (s) = max_{a} \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) [r + γ v_{*} (s^{'})]$

Intuition

The Ceiling of Performance

The optimal policy represents the best possible way to behave in an environment. Reinforcement Learning algorithms (like Q-Learning or REINFORCE) are essentially searching for this policy or its corresponding value function.

Connections

Attained when solving: Bellman Equation (optimality version)
Goal of: Q-Learning (approximates $q_{*}$ )
Foundation for: MDP theory

Study Notes

Explorer

Optimal Policy

Optimal Policy

Definition

Key Properties

Mathematical Relation

Intuition

Connections

Appears In

Graph View

Table of Contents

Backlinks