RL Lecture 13 - Partial Observability
Overview & Motivation
So far in this course, we have assumed the agent interacts with a Markov Decision Process in which it has access to the true state at every time step. All value-based, policy-based, and model-based methods we studied relied on this assumption. This lecture asks: what if the agent cannot observe the full state?
In many real-world problems, the agent receives only an observation that provides incomplete or noisy information about the latent state . This is called partial observability, and the resulting problem is a POMDP (Partially Observable Markov Decision Process).
Three common sources of partial observability:
- Aliasing: Different states look identical (“I am in front of a door but I don’t know which one”)
- Noise: Observations are corrupted (“My GPS tells me where I am, but it can be off a bit”)
- Missing information: Some state variables are simply unobserved (“Which direction is the ball going?” from a single frame)
The core tension: how to construct a compact internal state from the history of interactions that captures enough information for good decision-making.
Observations vs. States
Definition
In a POMDP, the agent does not observe the true latent state directly. Instead, it receives an observation generated by an observation function .
The world is now described by:
- Transitions: (same as MDP)
- Rewards: (same as MDP)
- Observation function: (new!)
Terminology
| Symbol | Meaning |
|---|---|
| Same as in MDP | |
| Latent “true” state | |
| Observation | |
| Internal representation; represents knowledge about . Used by policy and value function |
Tip
In a standard MDP, all of these coincide: . Partial observability is precisely the setting where they diverge.
Histories and Markov Functions
For an optimal policy, we should use all information available. The complete history is:
Formula
History: (Unless stated otherwise, rewards are assumed predictable from .)
There exists an optimal policy of the form , but full histories are impractical to work with. Instead, we seek a feature function that summarizes the history compactly.
Desired Properties of
- Compactness: should be a low-dimensional summary of the history
- Markov property: should capture all relevant information for prediction
Definition
Markov Criterion for Internal State: A function of the history is a Markov state if: for all , , and histories .
Intuition
If two histories produce the same internal state, they must predict the same future observations. This is the generalization of the Markov property to partially observable settings.
Agent Architecture
The general architecture under partial observability separates the world (which has latent state , produces observations and rewards given actions ) from the agent (which maintains an internal state via an update function, and uses for its policy and value function).
The central question is: what to choose for the state update function?
Approach 1: Belief States
Attempt 0 — Full History as State
Using (the full history) as the internal state:
Advantages:
- Extremely simple
- Clearly a Markov function (trivially satisfies the criterion)
Disadvantages:
- Not compact; must remember all of history
- Tabular policy needs to represent all possible sequences
- Impossible in continuing problems
- Inefficient in episodic problems (e.g., in the Tiger problem, gives the same information as )
Belief State Definition
Definition
Belief State: The belief state is the posterior probability of being in latent state given the history: The internal state is this vector of probabilities over all possible latent states.
Bayesian Belief Update
Formula
Belief State Update (using Bayes’ Theorem): where:
- is the observation model
- is the transition model
- is the old belief (current internal state)
- The denominator is a normalizer
Intuition
The belief update is recursive: given the current belief , the action , and the new observation , we can compute without retaining the entire history. This is the key advantage of belief states.
The Tiger Problem
Example
Tiger Problem (Kaelbling et al., 1998; thanks F. Oliehoek & S. Whiteson):
- Two doors: a tiger is behind one, treasure behind the other
- Actions: Open Left (OL), Open Right (OR), Listen (L)
- Rewards: Opening the door with treasure gives ; opening the tiger door gives ; listening costs
- Observations: Listening gives a noisy signal: the agent hears correctly 85% of the time
- If tiger is on the left: ,
- If tiger is on the right: ,
- In this problem, the transition model is deterministic: listening does not change the latent state ( if , otherwise)
Internal State for Tiger Problem: In the Tiger problem, hearing gives the same information as hearing . We only need to track two numbers: and . The vector is a useful Markov state for this problem. Since all histories with this state have the same predictions, this is a Markov function. Adding an observation just means incrementing the appropriate total.
Tiger Problem: Belief Update Walkthrough
Initial belief (): (50% chance tiger is left).
After one Listen, hearing HL ():
Using the belief update formula (with identity transition since listening does not move the tiger):
So after hearing left once, .
Tiger Problem: Value Analysis at Different Belief States
At initial belief :
| Action | Expected Reward |
|---|---|
| Open Left (OL) | |
| Open Right (OR) | |
| Listen (L) | (plus future value) |
At belief (after one listen hearing left):
| Action | Expected Reward |
|---|---|
| Open Left (OL) | |
| Open Right (OR) | |
| Listen (L) | (plus future value) |
Probability of next observations at :
Warning
Note that at , hearing HL (tiger on left confirmed further) occurs with probability , while hearing HR occurs with probability . This is because the tiger is most likely on the left, so the correct observation HR (“hear right”) is more probable --- wait, actually the observations are about what you hear, and hearing HL means hearing the tiger on the left, which is the correct observation if tiger is on the left. Since , hearing left with probability . The slides compute and symmetrically (both equal to ), with the remaining probability split as and for the stronger belief updates.
After two listens in the same direction, the belief reaches :
| Action | Expected Reward |
|---|---|
| Open Left (OL) | |
| Open Right (OR) | |
| Listen (L) | (plus future value) |
At an even stronger belief :
- ,
Tiger Problem: Optimal Q-Values
Planning (Dynamic Programming) in the belief state MDP yields the optimal policy and -function (e.g., via value iteration):
| Belief State | |
|---|---|
| (initial) | |
| (after 1 listen) | |
| (after 2 listens same dir.) | |
| Maximum possible |
Intuition
The optimal policy listens until the belief is strong enough, then opens the door away from the tiger. The value increases as belief becomes more certain, approaching (the treasure reward). The belief state MDP is fully observable because we always know what belief state we are in.
Belief State Approach: Summary
Summary
Belief State Approach (classical POMDP approach):
Advantages:
- Concrete meaning of internal state: probability distribution over latent state
- Relatively compact: has as many dimensions as has states (does not grow with history length)
- State can be updated recursively without memorizing history
Disadvantages:
- Underlying observation and transition models are needed
- Underlying models are difficult to learn from data
- Only applicable to discrete state spaces
Approach 2: Predictive State Representations
Motivation
Recall the Markov criterion:
What if we define the internal state directly as the probability of next observations?
Definition
Predictive State Representation (PSR): Define the internal state as a vector of predictions about future observations: The full state vector is:
By definition, this fulfils the Markov criterion (since equal predictions imply equal future observation distributions).
Longer Tests
We can also consider longer tests, e.g., , and define:
Formula
Test Probability:
It can be proven that for special sets of core tests , the vector is a Markov state.
PSR: Tiger Problem Example
Example
In the Tiger problem:
- If we do not know , we cannot calculate the belief state
- However, all information can be captured by just two tests (or even one):
- — probability of hearing left if we listen
- — probability of hearing right if we listen
- These probabilities can be learned from data (e.g., naively with an LSTM classifier, but smarter methods exist)
PSR: Summary
Summary
Predictive State Representations:
Advantages:
- Test probabilities are learnable from data (no need for known models)
- As compact or more so than belief states
- Can still be updated recursively
Disadvantages:
- Still limited to the tabular setting (though extensions exist)
Approach 3: Approximate Methods
Using Observations Directly
The simplest approximation: use the last observation as internal state.
This is equivalent to treating the observation as if it were the full state, using standard RL techniques.
Frame Stacking
A better approximation: use the most recent observations (and possibly actions) as internal state:
Formula
Frame Stacking:
Example
The Atari DQN paper (Mnih et al., 2013) used 4 stacked frames as input to the network. This allows the agent to infer velocity (e.g., direction of ball movement) from a sequence of frames.
This can be seen as extracting features from the history.
Advantages:
- Very simple to define and use
Disadvantages:
- Could be very suboptimal if memory of more than steps is needed
- Potentially not very compact
- Potentially non-Markov
Relation to State Aggregation
Tip
Treating observations or stacks as internal state is equivalent to state aggregation:
State Aggregation (MDP) Partial Observability High-d Markov state True latent state Aggregation features Observation Value prediction Value prediction , where
A Practical Note
Warning
While using seems naive, consider:
- With function approximation, there is typically no guarantee that given state features define a Markov state anyway
- In practice, many RL systems treat state as Markov even if it really is not, effectively using
- As long as the system is “close enough” to Markov, this can work well enough, even if not optimal
Deep Recurrent Q-Learning (DRQN)
One insight of DQN was that features can be learned. Can we learn end-to-end in partially observable settings?
Definition
Deep Recurrent Q-Learning (DRQN) (Hausknecht & Stone, 2015): Replace the first fully-connected layer in DQN with an LSTM recurrent layer.
- The LSTM processes a sequence of observations, maintaining internal memory across time steps
- Convolutional layers extract visual features from each observation
- The LSTM layer aggregates information over time, producing an internal state
- Trained using prediction loss on a target Q-network (as in DQN)
- The network is unrolled over time during training
Two training strategies:
- Bootstrapped: Sample random starting points within an episode and unroll from there
- Sequential: Process episodes sequentially, maintaining hidden state
The internal state at each time step is the LSTM hidden state, which can in principle depend on the entire history (not just the last frames).
End-to-End Learned States: Summary
Advantages:
- Conceptually simple and ties in to deep learning methodology
- Compared to frame stacking, no fixed ; state can depend on the full history
- Can adjust compactness (up to a point)
Disadvantages:
- RNN learning can be tricky in practice (local optima, long training time)
- Potentially non-Markov
Comparison of Approaches
Exact Methods
| Method | Compact? | Markov? | Requires Model? | Learnable? |
|---|---|---|---|---|
| Full History () | No | Yes | No | N/A |
| Belief State | Yes (dim = ||) | Yes | Yes | Hard |
| PSR | Most compact | Yes | No | Yes (tabular) |
Approximate Methods
| Method | Compact? | Markov? | Ease of Use | Weakness |
|---|---|---|---|---|
| or Frame Stacking | Medium | No (generally) | Very easy | Loses long-term dependencies |
| End-to-End (DRQN) | Adjustable | No (generally) | Moderate | RNN training tricky; needs much data |
Intuition
There is a fundamental trade-off between:
- Compactness of the internal state
- Markov property (does the state predict the future?)
- Interpretability (what does the state mean?)
- Computational complexity of updates and learning
- Ease of implementation
A Note on Timeouts
Warning
In practice, implementations often have a timeout: if a terminal state is not reached by timestep , force termination. If we set on the last step, this introduces non-Markovianity (since termination is not predictable from the state alone). The time step can be included as part of the state and observation to restore Markovianity, if a time-dependent policy or value function is desired.
Summary & Key Takeaways
Summary
Core Contributions of This Lecture:
Partial Observability: In a POMDP, the agent receives observations rather than true states . An internal state must be extracted from the history of interactions.
Belief States: Define and update via Bayes’ rule. This is the classical POMDP approach: Markov, compact, recursively updatable, but requires known models and discrete state spaces.
Tiger Problem: Illustrates belief state dynamics. Starting at , listening updates belief (e.g., to after one listen). The optimal policy listens until confident enough, then acts. Planning in belief space via Dynamic Programming yields at .
Predictive State Representations: Define internal state as predictions of future observations. Learnable from data, at least as compact as belief states, but limited to tabular settings.
Approximate Methods: Frame stacking (as in Atari DQN) and Deep Recurrent Q-Learning (DRQN, which replaces FC layers with LSTM) offer practical alternatives that trade exactness for scalability.
What you should know:
- What is a state update function and why do we need it?
- What are the advantages and disadvantages of each discussed state update function?
- Belief states: definition, Bayesian update (conceptually, not necessarily compute), Tiger problem structure
- PSR: basic idea of defining state via predictions
- Approximate methods: frame stacking, DRQN basics
- What is the relation of partial observability to state aggregation?
New Concepts to Explore
The following concepts are introduced but require deeper study:
- POMDP - Partially Observable Markov Decision Process framework
- Belief State - Posterior distribution over latent states as internal state
- Predictive State Representation - Internal state defined via observation predictions
- Deep Recurrent Q-Learning - LSTM-augmented DQN for partial observability
- Partial Observability - Settings where the agent cannot observe the full state
- LSTM - Long Short-Term Memory networks for sequential processing
- Bayes’ Theorem - Foundation for belief state updates
- State Aggregation - Connection between partial observability and function approximation
References
- Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and Acting in Partially Observable Stochastic Domains. Artificial Intelligence.
- Littman, M. L., Sutton, R. S., & Singh, S. (2002). Predictive Representations of State. NIPS.
- Hausknecht, M., & Stone, P. (2015). Deep Recurrent Q-Learning for Partially Observable MDPs. AAAI Fall Symposium.
- Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2013). Playing Atari with Deep Reinforcement Learning. NIPS Deep Learning Workshop.