Markov Decision Process (MDP)
Definition
Markov Decision Process
A Markov Decision Process is a mathematical framework for modeling sequential decision-making problems where outcomes are partly random and partly under the control of a decision-maker (agent). An MDP is defined by the tuple .
Components:
- — State space: the set of all possible states
- (or ) — Action space: the set of all possible actions (may depend on state)
- — Dynamics function: probability of transitioning to state with reward , given state and action
- or — Reward signal: immediate numerical feedback
- — Discount Factor: controls importance of future rewards
The Markov Property
Markov Property
The future is conditionally independent of the past given the present state. The state captures all relevant information from the history.
Why Markov Matters
“The state is a sufficient statistic of history.” If you know the current state, knowing how you got there doesn’t give you any additional information about what will happen next. This is what makes MDPs computationally tractable — you don’t need to store the entire history.
Dynamics Function
The dynamics function completely characterizes the environment:
From , we can derive everything else:
Derived Quantities from Dynamics
State-transition probabilities:
Expected reward for state-action pair:
Expected reward for state-action-next-state triple:
Agent–Environment Interface
┌─────────┐
Aₜ │ │ Sₜ₊₁, Rₜ₊₁
───────► │ Env │─────────────►
│ │ │
└─────────┘ │
▲ │
│ ▼
┌─────────┐
│ Agent │
│ (Policy)│
└─────────┘
At each time step :
- Agent observes state
- Agent selects action according to its Policy
- Environment transitions to and emits reward according to
- Repeat
Episodic vs Continuing Tasks
- Episodic tasks: Interaction naturally breaks into episodes with a terminal state (e.g., games, maze navigation). The Return is a finite sum.
- Continuing tasks: Interaction goes on forever without natural termination (e.g., process control). Requires for the return to be finite.
Key Properties
- MDPs provide the theoretical foundation for all of RL
- Dynamic Programming methods require knowing explicitly (model-based)
- Monte Carlo Methods and Temporal Difference Learning learn without knowing (model-free)
- The optimal solution is found via the Bellman Optimality Equation
Connections
- Generalizes: Multi-Armed Bandit (bandit = 1-state MDP)
- Foundation for: Value Function, Bellman Equation, Policy, Dynamic Programming
- Extended by: POMDP (partial observability), Semi-MDPs, Factored MDPs