RL Lecture 13 - Partial Observability

Overview & Motivation

So far in this course, we have assumed the agent interacts with a Markov Decision Process in which it has access to the true state at every time step. All value-based, policy-based, and model-based methods we studied relied on this assumption. This lecture asks: what if the agent cannot observe the full state?

In many real-world problems, the agent receives only an observation that provides incomplete or noisy information about the latent state . This is called partial observability, and the resulting problem is a POMDP (Partially Observable Markov Decision Process).

Three common sources of partial observability:

  • Aliasing: Different states look identical (“I am in front of a door but I don’t know which one”)
  • Noise: Observations are corrupted (“My GPS tells me where I am, but it can be off a bit”)
  • Missing information: Some state variables are simply unobserved (“Which direction is the ball going?” from a single frame)

The core tension: how to construct a compact internal state from the history of interactions that captures enough information for good decision-making.


Observations vs. States

Definition

In a POMDP, the agent does not observe the true latent state directly. Instead, it receives an observation generated by an observation function .

The world is now described by:

  • Transitions: (same as MDP)
  • Rewards: (same as MDP)
  • Observation function: (new!)

Terminology

SymbolMeaning
Same as in MDP
Latent “true” state
Observation
Internal representation; represents knowledge about . Used by policy and value function

Tip

In a standard MDP, all of these coincide: . Partial observability is precisely the setting where they diverge.


Histories and Markov Functions

For an optimal policy, we should use all information available. The complete history is:

Formula

History: (Unless stated otherwise, rewards are assumed predictable from .)

There exists an optimal policy of the form , but full histories are impractical to work with. Instead, we seek a feature function that summarizes the history compactly.

Desired Properties of

  1. Compactness: should be a low-dimensional summary of the history
  2. Markov property: should capture all relevant information for prediction

Definition

Markov Criterion for Internal State: A function of the history is a Markov state if: for all , , and histories .

Intuition

If two histories produce the same internal state, they must predict the same future observations. This is the generalization of the Markov property to partially observable settings.

Agent Architecture

The general architecture under partial observability separates the world (which has latent state , produces observations and rewards given actions ) from the agent (which maintains an internal state via an update function, and uses for its policy and value function).

The central question is: what to choose for the state update function?


Approach 1: Belief States

Attempt 0 — Full History as State

Using (the full history) as the internal state:

Advantages:

  • Extremely simple
  • Clearly a Markov function (trivially satisfies the criterion)

Disadvantages:

  • Not compact; must remember all of history
  • Tabular policy needs to represent all possible sequences
  • Impossible in continuing problems
  • Inefficient in episodic problems (e.g., in the Tiger problem, gives the same information as )

Belief State Definition

Definition

Belief State: The belief state is the posterior probability of being in latent state given the history: The internal state is this vector of probabilities over all possible latent states.

Bayesian Belief Update

Formula

Belief State Update (using Bayes’ Theorem): where:

  • is the observation model
  • is the transition model
  • is the old belief (current internal state)
  • The denominator is a normalizer

Intuition

The belief update is recursive: given the current belief , the action , and the new observation , we can compute without retaining the entire history. This is the key advantage of belief states.


The Tiger Problem

Example

Tiger Problem (Kaelbling et al., 1998; thanks F. Oliehoek & S. Whiteson):

  • Two doors: a tiger is behind one, treasure behind the other
  • Actions: Open Left (OL), Open Right (OR), Listen (L)
  • Rewards: Opening the door with treasure gives ; opening the tiger door gives ; listening costs
  • Observations: Listening gives a noisy signal: the agent hears correctly 85% of the time
    • If tiger is on the left: ,
    • If tiger is on the right: ,
  • In this problem, the transition model is deterministic: listening does not change the latent state ( if , otherwise)

Internal State for Tiger Problem: In the Tiger problem, hearing gives the same information as hearing . We only need to track two numbers: and . The vector is a useful Markov state for this problem. Since all histories with this state have the same predictions, this is a Markov function. Adding an observation just means incrementing the appropriate total.

Tiger Problem: Belief Update Walkthrough

Initial belief (): (50% chance tiger is left).

After one Listen, hearing HL ():

Using the belief update formula (with identity transition since listening does not move the tiger):

So after hearing left once, .

Tiger Problem: Value Analysis at Different Belief States

At initial belief :

ActionExpected Reward
Open Left (OL)
Open Right (OR)
Listen (L) (plus future value)

At belief (after one listen hearing left):

ActionExpected Reward
Open Left (OL)
Open Right (OR)
Listen (L) (plus future value)

Probability of next observations at :

Warning

Note that at , hearing HL (tiger on left confirmed further) occurs with probability , while hearing HR occurs with probability . This is because the tiger is most likely on the left, so the correct observation HR (“hear right”) is more probable --- wait, actually the observations are about what you hear, and hearing HL means hearing the tiger on the left, which is the correct observation if tiger is on the left. Since , hearing left with probability . The slides compute and symmetrically (both equal to ), with the remaining probability split as and for the stronger belief updates.

After two listens in the same direction, the belief reaches :

ActionExpected Reward
Open Left (OL)
Open Right (OR)
Listen (L) (plus future value)

At an even stronger belief :

  • ,

Tiger Problem: Optimal Q-Values

Planning (Dynamic Programming) in the belief state MDP yields the optimal policy and -function (e.g., via value iteration):

Belief State
(initial)
(after 1 listen)
(after 2 listens same dir.)
Maximum possible

Intuition

The optimal policy listens until the belief is strong enough, then opens the door away from the tiger. The value increases as belief becomes more certain, approaching (the treasure reward). The belief state MDP is fully observable because we always know what belief state we are in.

Belief State Approach: Summary

Summary

Belief State Approach (classical POMDP approach):

Advantages:

  • Concrete meaning of internal state: probability distribution over latent state
  • Relatively compact: has as many dimensions as has states (does not grow with history length)
  • State can be updated recursively without memorizing history

Disadvantages:

  • Underlying observation and transition models are needed
  • Underlying models are difficult to learn from data
  • Only applicable to discrete state spaces

Approach 2: Predictive State Representations

Motivation

Recall the Markov criterion:

What if we define the internal state directly as the probability of next observations?

Definition

Predictive State Representation (PSR): Define the internal state as a vector of predictions about future observations: The full state vector is:

By definition, this fulfils the Markov criterion (since equal predictions imply equal future observation distributions).

Longer Tests

We can also consider longer tests, e.g., , and define:

Formula

Test Probability:

It can be proven that for special sets of core tests , the vector is a Markov state.

PSR: Tiger Problem Example

Example

In the Tiger problem:

  • If we do not know , we cannot calculate the belief state
  • However, all information can be captured by just two tests (or even one):
    • — probability of hearing left if we listen
    • — probability of hearing right if we listen
  • These probabilities can be learned from data (e.g., naively with an LSTM classifier, but smarter methods exist)

PSR: Summary

Summary

Predictive State Representations:

Advantages:

  • Test probabilities are learnable from data (no need for known models)
  • As compact or more so than belief states
  • Can still be updated recursively

Disadvantages:

  • Still limited to the tabular setting (though extensions exist)

Approach 3: Approximate Methods

Using Observations Directly

The simplest approximation: use the last observation as internal state.

This is equivalent to treating the observation as if it were the full state, using standard RL techniques.

Frame Stacking

A better approximation: use the most recent observations (and possibly actions) as internal state:

Formula

Frame Stacking:

Example

The Atari DQN paper (Mnih et al., 2013) used 4 stacked frames as input to the network. This allows the agent to infer velocity (e.g., direction of ball movement) from a sequence of frames.

This can be seen as extracting features from the history.

Advantages:

  • Very simple to define and use

Disadvantages:

  • Could be very suboptimal if memory of more than steps is needed
  • Potentially not very compact
  • Potentially non-Markov

Relation to State Aggregation

Tip

Treating observations or stacks as internal state is equivalent to state aggregation:

State Aggregation (MDP)Partial Observability
High-d Markov state True latent state
Aggregation features Observation
Value prediction Value prediction , where

A Practical Note

Warning

While using seems naive, consider:

  • With function approximation, there is typically no guarantee that given state features define a Markov state anyway
  • In practice, many RL systems treat state as Markov even if it really is not, effectively using
  • As long as the system is “close enough” to Markov, this can work well enough, even if not optimal

Deep Recurrent Q-Learning (DRQN)

One insight of DQN was that features can be learned. Can we learn end-to-end in partially observable settings?

Definition

Deep Recurrent Q-Learning (DRQN) (Hausknecht & Stone, 2015): Replace the first fully-connected layer in DQN with an LSTM recurrent layer.

  • The LSTM processes a sequence of observations, maintaining internal memory across time steps
  • Convolutional layers extract visual features from each observation
  • The LSTM layer aggregates information over time, producing an internal state
  • Trained using prediction loss on a target Q-network (as in DQN)
  • The network is unrolled over time during training

Two training strategies:

  1. Bootstrapped: Sample random starting points within an episode and unroll from there
  2. Sequential: Process episodes sequentially, maintaining hidden state

The internal state at each time step is the LSTM hidden state, which can in principle depend on the entire history (not just the last frames).

End-to-End Learned States: Summary

Advantages:

  • Conceptually simple and ties in to deep learning methodology
  • Compared to frame stacking, no fixed ; state can depend on the full history
  • Can adjust compactness (up to a point)

Disadvantages:

  • RNN learning can be tricky in practice (local optima, long training time)
  • Potentially non-Markov

Comparison of Approaches

Exact Methods

MethodCompact?Markov?Requires Model?Learnable?
Full History ()NoYesNoN/A
Belief StateYes (dim = ||)YesYesHard
PSRMost compactYesNoYes (tabular)

Approximate Methods

MethodCompact?Markov?Ease of UseWeakness
or Frame StackingMediumNo (generally)Very easyLoses long-term dependencies
End-to-End (DRQN)AdjustableNo (generally)ModerateRNN training tricky; needs much data

Intuition

There is a fundamental trade-off between:

  • Compactness of the internal state
  • Markov property (does the state predict the future?)
  • Interpretability (what does the state mean?)
  • Computational complexity of updates and learning
  • Ease of implementation

A Note on Timeouts

Warning

In practice, implementations often have a timeout: if a terminal state is not reached by timestep , force termination. If we set on the last step, this introduces non-Markovianity (since termination is not predictable from the state alone). The time step can be included as part of the state and observation to restore Markovianity, if a time-dependent policy or value function is desired.


Summary & Key Takeaways

Summary

Core Contributions of This Lecture:

  1. Partial Observability: In a POMDP, the agent receives observations rather than true states . An internal state must be extracted from the history of interactions.

  2. Belief States: Define and update via Bayes’ rule. This is the classical POMDP approach: Markov, compact, recursively updatable, but requires known models and discrete state spaces.

  3. Tiger Problem: Illustrates belief state dynamics. Starting at , listening updates belief (e.g., to after one listen). The optimal policy listens until confident enough, then acts. Planning in belief space via Dynamic Programming yields at .

  4. Predictive State Representations: Define internal state as predictions of future observations. Learnable from data, at least as compact as belief states, but limited to tabular settings.

  5. Approximate Methods: Frame stacking (as in Atari DQN) and Deep Recurrent Q-Learning (DRQN, which replaces FC layers with LSTM) offer practical alternatives that trade exactness for scalability.

What you should know:

  • What is a state update function and why do we need it?
  • What are the advantages and disadvantages of each discussed state update function?
  • Belief states: definition, Bayesian update (conceptually, not necessarily compute), Tiger problem structure
  • PSR: basic idea of defining state via predictions
  • Approximate methods: frame stacking, DRQN basics
  • What is the relation of partial observability to state aggregation?

New Concepts to Explore

The following concepts are introduced but require deeper study:


References

  • Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and Acting in Partially Observable Stochastic Domains. Artificial Intelligence.
  • Littman, M. L., Sutton, R. S., & Singh, S. (2002). Predictive Representations of State. NIPS.
  • Hausknecht, M., & Stone, P. (2015). Deep Recurrent Q-Learning for Partially Observable MDPs. AAAI Fall Symposium.
  • Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2013). Playing Atari with Deep Reinforcement Learning. NIPS Deep Learning Workshop.