POMDP (Partially Observable Markov Decision Process)
POMDP
A Partially Observable Markov Decision Process extends the MDP framework to settings where the agent cannot directly observe the true state of the environment. Instead, the agent receives observations that provide partial or noisy information about the underlying state.
Formal Definition
A POMDP is defined by the tuple :
| Component | Description |
|---|---|
| Set of (hidden/latent) states | |
| Set of actions | |
| Set of observations | |
| $T(x’ | x,a)$ |
| Reward function | |
| $Z(o | x’,a)$ |
| Discount factor |
The Challenge
Why Partial Observability Is Hard
In an MDP, the current state tells you everything you need to make an optimal decision. In a POMDP, the observation doesn’t — two different underlying states might produce the same observation. The agent must reason about what state it might be in based on its history of observations and actions.
- History: contains all available information
- The history grows without bound — we need a compact sufficient statistic
Approaches to Handle Partial Observability
1. Belief State
Maintain a probability distribution over hidden states: . The belief state MDP is fully observable.
2. Predictive State Representation
Define internal state as predictions about future observations rather than beliefs about hidden states.
3. Approximate Methods
Use recent observations as state (frame stacking) or recurrent networks (Deep Recurrent Q-Learning).
Markov Criterion for Internal State
Markov Criterion
An internal state representation is Markov if:
That is, if two histories map to the same internal state, they must predict the same future observations.
Key Properties
- POMDPs are strictly harder than MDPs (optimal POMDP policies may be stochastic even when MDP optimal policies are deterministic)
- The belief state MDP converts a POMDP into a (continuous-state) MDP
- In practice, many RL systems ignore partial observability and treat observations as states
Connections
- Generalizes Markov Decision Process — MDP is a POMDP where observations = states
- Solved via Belief State (exact) or approximate methods
- Deep Recurrent Q-Learning — deep RL approach to POMDPs
- Related to Importance Sampling in the sense that both deal with incomplete information