Markov Decision Process (MDP)

Definition

Markov Decision Process

A Markov Decision Process is a mathematical framework for modeling sequential decision-making problems where outcomes are partly random and partly under the control of a decision-maker (agent). An MDP is defined by the tuple $(S, A, p, r, γ)$ .

Components:

$S$ — State space: the set of all possible states
$A$ (or $A (s)$ ) — Action space: the set of all possible actions (may depend on state)
$p (s^{'}, r ∣ s, a)$ — Dynamics function: probability of transitioning to state $s^{'}$ with reward $r$ , given state $s$ and action $a$
$r (s, a)$ or $R_{t}$ — Reward signal: immediate numerical feedback
$γ \in [0, 1]$ — Discount Factor: controls importance of future rewards

The Markov Property

Markov Property

$P (S_{t + 1} = s^{'}, R_{t + 1} = r ∣ S_{t}, A_{t}, S_{t - 1}, A_{t - 1}, \dots, S_{0}, A_{0}) = P (S_{t + 1} = s^{'}, R_{t + 1} = r ∣ S_{t}, A_{t})$

The future is conditionally independent of the past given the present state. The state captures all relevant information from the history.

Why Markov Matters

“The state is a sufficient statistic of history.” If you know the current state, knowing how you got there doesn’t give you any additional information about what will happen next. This is what makes MDPs computationally tractable — you don’t need to store the entire history.

Dynamics Function

The dynamics function $p$ completely characterizes the environment:

$p (s^{'}, r ∣ s, a) = Pr {S_{t + 1} = s^{'}, R_{t + 1} = r ∣ S_{t} = s, A_{t} = a}$

From $p$ , we can derive everything else:

Derived Quantities from Dynamics

State-transition probabilities: $p (s^{'} ∣ s, a) = \sum_{r \in R} p (s^{'}, r ∣ s, a)$

Expected reward for state-action pair: $r (s, a) = E [R_{t + 1} ∣ S_{t} = s, A_{t} = a] = \sum_{r \in R} r \sum_{s^{'} \in S} p (s^{'}, r ∣ s, a)$

Expected reward for state-action-next-state triple: $r (s, a, s^{'}) = E [R_{t + 1} ∣ S_{t} = s, A_{t} = a, S_{t + 1} = s^{'}] = \frac{\sum _{r \in R} r \cdot p ( s ^{'} , r ∣ s , a )}{p ( s ^{'} ∣ s , a )}$

Agent–Environment Interface

        ┌─────────┐
   Aₜ   │         │  Sₜ₊₁, Rₜ₊₁
───────► │  Env    │─────────────►
         │         │              │
         └─────────┘              │
              ▲                   │
              │                   ▼
         ┌─────────┐
         │  Agent  │
         │ (Policy)│
         └─────────┘

At each time step $t$ :

Agent observes state $S_{t} \in S$
Agent selects action $A_{t} \in A (S_{t})$ according to its Policy $π$
Environment transitions to $S_{t + 1}$ and emits reward $R_{t + 1}$ according to $p (s^{'}, r ∣ s, a)$
Repeat

Episodic vs Continuing Tasks

Episodic tasks: Interaction naturally breaks into episodes with a terminal state (e.g., games, maze navigation). The Return is a finite sum.
Continuing tasks: Interaction goes on forever without natural termination (e.g., process control). Requires $γ < 1$ for the return to be finite.

Key Properties

MDPs provide the theoretical foundation for all of RL
Dynamic Programming methods require knowing $p (s^{'}, r ∣ s, a)$ explicitly (model-based)
Monte Carlo Methods and Temporal Difference Learning learn without knowing $p$ (model-free)
The optimal solution is found via the Bellman Optimality Equation

Connections

Generalizes: Multi-Armed Bandit (bandit = 1-state MDP)
Foundation for: Value Function, Bellman Equation, Policy, Dynamic Programming
Extended by: POMDP (partial observability), Semi-MDPs, Factored MDPs

Study Notes

Explorer

Markov Decision Process

Markov Decision Process (MDP)

Definition

The Markov Property

Dynamics Function

Agent–Environment Interface

Episodic vs Continuing Tasks

Key Properties

Connections

Appears In

Graph View

Table of Contents

Backlinks