RL Lecture 13 - Partial Observability

Overview & Motivation

So far in this course, we have assumed the agent interacts with a Markov Decision Process in which it has access to the true state $x_{t}$ at every time step. All value-based, policy-based, and model-based methods we studied relied on this assumption. This lecture asks: what if the agent cannot observe the full state?

In many real-world problems, the agent receives only an observation $o_{t}$ that provides incomplete or noisy information about the latent state $x_{t}$ . This is called partial observability, and the resulting problem is a POMDP (Partially Observable Markov Decision Process).

Three common sources of partial observability:

Aliasing: Different states look identical (“I am in front of a door but I don’t know which one”)
Noise: Observations are corrupted (“My GPS tells me where I am, but it can be off a bit”)
Missing information: Some state variables are simply unobserved (“Which direction is the ball going?” from a single frame)

The core tension: how to construct a compact internal state $s_{t}$ from the history of interactions that captures enough information for good decision-making.

Observations vs. States

Definition

In a POMDP, the agent does not observe the true latent state $x_{t}$ directly. Instead, it receives an observation $o_{t}$ generated by an observation function $p (o^{'} ∣ a, x^{'})$ .

The world is now described by:

Transitions: $p (x^{'} ∣ a, x)$ (same as MDP)
Rewards: $r (x^{'}, a)$ (same as MDP)
Observation function: $p (o^{'} ∣ a, x^{'})$ (new!)

Terminology

Symbol	Meaning
$a, r$	Same as in MDP
$x$	Latent “true” state
$o$	Observation
$s$	Internal representation; represents knowledge about $x$ . Used by policy and value function

Tip

In a standard MDP, all of these coincide: $s = o = x$ . Partial observability is precisely the setting where they diverge.

Histories and Markov Functions

For an optimal policy, we should use all information available. The complete history is:

Formula

History: $H_{t} = (A_{0}, O_{1}, A_{1}, O_{2}, \dots, A_{t - 1}, O_{t})$ (Unless stated otherwise, rewards are assumed predictable from $O$ .)

There exists an optimal policy of the form $A_{t} = π (H_{t})$ , but full histories are impractical to work with. Instead, we seek a feature function $f (H_{t})$ that summarizes the history compactly.

Desired Properties of $f (H_{t})$

Compactness: $f (H_{t})$ should be a low-dimensional summary of the history
Markov property: $f (H_{t})$ should capture all relevant information for prediction

Definition

Markov Criterion for Internal State: A function $f$ of the history is a Markov state if: $f (h) = f (h^{'}) ⟹ Pr {O_{t + 1} = o ∣ H_{t} = h, A_{t} = a} = Pr {O_{t + 1} = o ∣ H_{t} = h^{'}, A_{t} = a}$ for all $o$ , $a$ , and histories $h, h^{'}$ .

Intuition

If two histories produce the same internal state, they must predict the same future observations. This is the generalization of the Markov property to partially observable settings.

Agent Architecture

The general architecture under partial observability separates the world (which has latent state $X$ , produces observations $O$ and rewards $R$ given actions $A$ ) from the agent (which maintains an internal state $S$ via an update function, and uses $S$ for its policy and value function).

The central question is: what to choose for the state update function?

Approach 1: Belief States

Attempt 0 — Full History as State

Using $S = H$ (the full history) as the internal state:

Advantages:

Extremely simple
Clearly a Markov function (trivially satisfies the criterion)

Disadvantages:

Not compact; must remember all of history
Tabular policy needs to represent all possible sequences
Impossible in continuing problems
Inefficient in episodic problems (e.g., in the Tiger problem, $HL, HR$ gives the same information as $HR, HL$ )

Belief State Definition

Definition

Belief State: The belief state $b_{t} (x)$ is the posterior probability of being in latent state $x$ given the history: $b_{t} (x) = Pr (X_{t} = x ∣ H_{t} = h_{t})$ The internal state $s$ is this vector of probabilities over all possible latent states.

Bayesian Belief Update

Formula

Belief State Update (using Bayes’ Theorem): $b_{t + 1} (x^{'}) = p (x^{'} ∣ h^{'}) = p (x^{'} ∣ o^{'}, a, h) = \frac{p ( o ^{'} ∣ x ^{'} , a ) \sum _{x} p ( x ^{'} ∣ x , a ) b _{t} ( x )}{\sum _{\tilde{x}^{'}} p ( o ^{'} ∣ x ~ ^{'} , a ) \sum _{x} p ( x ~ ^{'} ∣ x , a ) b _{t} ( x )}$ where:

$p (o^{'} ∣ x^{'}, a)$ is the observation model

$p (x^{'} ∣ x, a)$ is the transition model

$b_{t} (x)$ is the old belief (current internal state)

The denominator is a normalizer

Intuition

The belief update is recursive: given the current belief $b_{t}$ , the action $a_{t}$ , and the new observation $o_{t + 1}$ , we can compute $b_{t + 1}$ without retaining the entire history. This is the key advantage of belief states.

The Tiger Problem

Example

Tiger Problem (Kaelbling et al., 1998; thanks F. Oliehoek & S. Whiteson):

Two doors: a tiger is behind one, treasure behind the other

Actions: Open Left (OL), Open Right (OR), Listen (L)

Rewards: Opening the door with treasure gives $+ 10$ ; opening the tiger door gives $- 100$ ; listening costs $- 1$

Observations: Listening gives a noisy signal: the agent hears correctly 85% of the time

If tiger is on the left: $p (HL ∣ x = L) = 0.85$ , $p (HR ∣ x = L) = 0.15$

If tiger is on the right: $p (HR ∣ x = R) = 0.85$ , $p (HL ∣ x = R) = 0.15$

In this problem, the transition model is deterministic: listening does not change the latent state ( $p (x^{'} ∣ x, a = L) = 1$ if $x^{'} = x$ , $0$ otherwise)

Internal State for Tiger Problem: In the Tiger problem, hearing $HL, HR$ gives the same information as hearing $HR, HL$ . We only need to track two numbers: $# HL$ and $# HR$ . The vector $[# HL, # HR]$ is a useful Markov state for this problem. Since all histories with this state have the same predictions, this is a Markov function. Adding an observation just means incrementing the appropriate total.

Tiger Problem: Belief Update Walkthrough

Initial belief ( $b_{0}$ ): $b (x = L) = 0.5$ (50% chance tiger is left).

After one Listen, hearing HL ( $o = HL$ ):

Using the belief update formula (with identity transition since listening does not move the tiger):

$b_{1} (x = L) = \frac{p ( HL ∣ x = L ) \cdot b _{0} ( x = L )}{p ( HL ∣ x = L ) \cdot b _{0} ( x = L ) + p ( HL ∣ x = R ) \cdot b _{0} ( x = R )} = \frac{0.85 \cdot 0.5}{0.85 \cdot 0.5 + 0.15 \cdot 0.5} = 0.85$

So after hearing left once, $b (HL) = 0.85$ .

Tiger Problem: Value Analysis at Different Belief States

At initial belief $b (HL) = 0.5$ :

Action	Expected Reward
Open Left (OL)	$0.5 \times (- 100) + 0.5 \times (+ 10) = - 45$
Open Right (OR)	$0.5 \times (+ 10) + 0.5 \times (- 100) = - 45$
Listen (L)	$- 1$ (plus future value)

At belief $b (HL) = 0.85$ (after one listen hearing left):

Action	Expected Reward
Open Left (OL)	$0.85 \times (- 100) + 0.15 \times (+ 10) = - 84$
Open Right (OR)	$0.85 \times (+ 10) + 0.15 \times (- 100) = - 6.5$
Listen (L)	$- 1$ (plus future value)

Probability of next observations at $b (HL) = 0.85$ :

$p (HL ∣ b = 0.85) = p (x = L ∣ b = 0.85) \cdot p (HL ∣ x = L) + p (x = R ∣ b = 0.85) \cdot p (HL ∣ x = R)$ $= 0.85 \times 0.15 + 0.15 \times 0.85 = 0.255$

$p (HR ∣ b = 0.85) = 1 - 0.255 = 0.745$

Warning

Note that at $b = 0.85$ , hearing HL (tiger on left confirmed further) occurs with probability $0.255$ , while hearing HR occurs with probability $0.745$ . This is because the tiger is most likely on the left, so the correct observation HR (“hear right”) is more probable --- wait, actually the observations are about what you hear, and hearing HL means hearing the tiger on the left, which is the correct observation if tiger is on the left. Since $b (L) = 0.85$ , hearing left with probability $0.85 \times 0.85 + 0.15 \times 0.15 \neq = 0.255$ . The slides compute $p (HL ∣ b = 0.85) = 0.85 \times 0.15 + 0.15 \times 0.85 = 0.255$ and $p (HR ∣ b = 0.85) = 0.255$ symmetrically (both equal to $0.255$ ), with the remaining probability split as $0.97$ and $0.03$ for the stronger belief updates.

After two listens in the same direction, the belief reaches $b (HL) = 0.97$ :

Action	Expected Reward
Open Left (OL)	$0.97 \times (- 100) + 0.03 \times (+ 10) = - 97$
Open Right (OR)	$0.97 \times (+ 10) + 0.03 \times (- 100) = 6.7$
Listen (L)	$- 1$ (plus future value)

At an even stronger belief $b (HL) = 0.995$ :

$OL = - 99$ , $OR = 9.45$

Tiger Problem: Optimal Q-Values

Planning (Dynamic Programming) in the belief state MDP yields the optimal policy and $Q$ -function (e.g., via value iteration):

Belief State	$Q^{*}$
$b (HL) = 0.5$ (initial)	$Q^{*} = 5.6$
$b (HL) = 0.85$ (after 1 listen)	$Q^{*} = 6.6$
$b (HL) = 0.97$ (after 2 listens same dir.)	$Q^{*} = 8.0$
Maximum possible	$+ 10$

Intuition

The optimal policy listens until the belief is strong enough, then opens the door away from the tiger. The value increases as belief becomes more certain, approaching $+ 10$ (the treasure reward). The belief state MDP is fully observable because we always know what belief state we are in.

Belief State Approach: Summary

Summary

Belief State Approach (classical POMDP approach):

Advantages:

Concrete meaning of internal state: probability distribution over latent state

Relatively compact: $s$ has as many dimensions as $x$ has states (does not grow with history length)

State can be updated recursively without memorizing history

Disadvantages:

Underlying observation and transition models are needed

Underlying models are difficult to learn from data

Only applicable to discrete state spaces

Approach 2: Predictive State Representations

Motivation

Recall the Markov criterion: $f (h) = f (h^{'}) ⟹ Pr {O_{t + 1} = o ∣ H_{t} = h, A_{t} = a} = Pr {O_{t + 1} = o ∣ H_{t} = h^{'}, A_{t} = a}$

What if we define the internal state directly as the probability of next observations?

Definition

Predictive State Representation (PSR): Define the internal state as a vector of predictions about future observations: $f_{o a} (h) := Pr {O_{t + 1} = o ∣ H_{t} = h, A_{t} = a}$ The full state vector is: $f (h) = f_{o_{1} a_{1}} (h) f_{o_{2} a_{1}} (h) ⋮ f_{o_{1} a_{2}} (h) ⋮$

By definition, this fulfils the Markov criterion (since equal predictions imply equal future observation distributions).

Longer Tests

We can also consider longer tests, e.g., $τ = a_{1} o_{1} a_{2} o_{2} a_{3} o_{3}$ , and define:

Formula

Test Probability: $p (τ ∣ h) = Pr {O_{t + 1} = o_{1}, O_{t + 2} = o_{2}, O_{t + 3} = o_{3} ∣ H_{t} = h, A_{t} = a_{1}, A_{t + 1} = a_{2}, A_{t + 2} = a_{3}}$

It can be proven that for special sets of core tests $τ_{1}, τ_{2}, \dots, τ_{d}$ , the vector $[p (τ_{1} ∣ h), p (τ_{2} ∣ h), \dots, p (τ_{d} ∣ h)]$ is a Markov state.

PSR: Tiger Problem Example

Example

In the Tiger problem:

If we do not know $p (o ∣ x)$ , we cannot calculate the belief state

However, all information can be captured by just two tests (or even one):

$p (HL ∣ h, L)$ — probability of hearing left if we listen

$p (HR ∣ h, L)$ — probability of hearing right if we listen

These probabilities can be learned from data (e.g., naively with an LSTM classifier, but smarter methods exist)

PSR: Summary

Summary

Predictive State Representations:

Advantages:

Test probabilities are learnable from data (no need for known models)

As compact or more so than belief states

Can still be updated recursively

Disadvantages:

Still limited to the tabular setting (though extensions exist)

Approach 3: Approximate Methods

Using Observations Directly

The simplest approximation: use the last observation as internal state.

$S = O$

This is equivalent to treating the observation as if it were the full state, using standard RL techniques.

Frame Stacking

A better approximation: use the $k$ most recent observations (and possibly actions) as internal state:

Formula

Frame Stacking: $S = (O_{t - k}, A_{t - k}, \dots, O_{t - 1}, A_{t - 1}, O_{t})$

Example

The Atari DQN paper (Mnih et al., 2013) used 4 stacked frames as input to the network. This allows the agent to infer velocity (e.g., direction of ball movement) from a sequence of frames.

This can be seen as extracting features from the history.

Advantages:

Very simple to define and use

Disadvantages:

Could be very suboptimal if memory of more than $k$ steps is needed
Potentially not very compact
Potentially non-Markov

Relation to State Aggregation

Tip

Treating observations or stacks as internal state is equivalent to state aggregation:

State Aggregation (MDP) Partial Observability
High-d Markov state $s$ True latent state $x$
Aggregation features $x$ Observation $o$
Value prediction $V (x)$ Value prediction $V (s)$ , where $s = o$

State Aggregation (MDP)	Partial Observability
High-d Markov state $s$	True latent state $x$
Aggregation features $x$	Observation $o$
Value prediction $V (x)$	Value prediction $V (s)$ , where $s = o$

A Practical Note

Warning

While using $S = O$ seems naive, consider:

With function approximation, there is typically no guarantee that given state features define a Markov state anyway

In practice, many RL systems treat state as Markov even if it really is not, effectively using $S = O$

As long as the system is “close enough” to Markov, this can work well enough, even if not optimal

Deep Recurrent Q-Learning (DRQN)

One insight of DQN was that features can be learned. Can we learn end-to-end in partially observable settings?

Definition

Deep Recurrent Q-Learning (DRQN) (Hausknecht & Stone, 2015): Replace the first fully-connected layer in DQN with an LSTM recurrent layer.

The LSTM processes a sequence of observations, maintaining internal memory across time steps

Convolutional layers extract visual features from each observation

The LSTM layer aggregates information over time, producing an internal state $s_{t}$

Trained using prediction loss on a target Q-network (as in DQN)

The network is unrolled over time during training

Two training strategies:

Bootstrapped: Sample random starting points within an episode and unroll from there
Sequential: Process episodes sequentially, maintaining hidden state

The internal state $s_{t}$ at each time step is the LSTM hidden state, which can in principle depend on the entire history (not just the last $k$ frames).

End-to-End Learned States: Summary

Advantages:

Conceptually simple and ties in to deep learning methodology
Compared to frame stacking, no fixed $k$ ; state can depend on the full history
Can adjust compactness (up to a point)

Disadvantages:

RNN learning can be tricky in practice (local optima, long training time)
Potentially non-Markov

Comparison of Approaches

Exact Methods

Method	Compact?	Markov?	Requires Model?	Learnable?
Full History ( $S = H$ )	No	Yes	No	N/A
Belief State	Yes (dim = \| $X$ \|)	Yes	Yes	Hard
PSR	Most compact	Yes	No	Yes (tabular)

Approximate Methods

Method	Compact?	Markov?	Ease of Use	Weakness
$S = O$ or Frame Stacking	Medium	No (generally)	Very easy	Loses long-term dependencies
End-to-End (DRQN)	Adjustable	No (generally)	Moderate	RNN training tricky; needs much data

Intuition

There is a fundamental trade-off between:

Compactness of the internal state

Markov property (does the state predict the future?)

Interpretability (what does the state mean?)

Computational complexity of updates and learning

Ease of implementation

A Note on Timeouts

Warning

In practice, implementations often have a timeout: if a terminal state is not reached by timestep $K$ , force termination. If we set $V (s^{'}) = 0$ on the last step, this introduces non-Markovianity (since termination is not predictable from the state alone). The time step can be included as part of the state and observation to restore Markovianity, if a time-dependent policy or value function is desired.

Summary & Key Takeaways

Summary

Core Contributions of This Lecture:

Partial Observability: In a POMDP, the agent receives observations $o_{t}$ rather than true states $x_{t}$ . An internal state must be extracted from the history of interactions.

Belief States: Define $b_{t} (x) = Pr (X_{t} = x ∣ H_{t})$ and update via Bayes’ rule. This is the classical POMDP approach: Markov, compact, recursively updatable, but requires known models and discrete state spaces.

Tiger Problem: Illustrates belief state dynamics. Starting at $b = 0.5$ , listening updates belief (e.g., to $0.85$ after one listen). The optimal policy listens until confident enough, then acts. Planning in belief space via Dynamic Programming yields $Q^{*} = 5.6$ at $b = 0.5$ .

Predictive State Representations: Define internal state as predictions of future observations. Learnable from data, at least as compact as belief states, but limited to tabular settings.

Approximate Methods: Frame stacking (as in Atari DQN) and Deep Recurrent Q-Learning (DRQN, which replaces FC layers with LSTM) offer practical alternatives that trade exactness for scalability.

What you should know:

What is a state update function and why do we need it?
What are the advantages and disadvantages of each discussed state update function?
Belief states: definition, Bayesian update (conceptually, not necessarily compute), Tiger problem structure
PSR: basic idea of defining state via predictions
Approximate methods: frame stacking, DRQN basics
What is the relation of partial observability to state aggregation?

New Concepts to Explore

The following concepts are introduced but require deeper study:

POMDP - Partially Observable Markov Decision Process framework
Belief State - Posterior distribution over latent states as internal state
Predictive State Representation - Internal state defined via observation predictions
Deep Recurrent Q-Learning - LSTM-augmented DQN for partial observability
Partial Observability - Settings where the agent cannot observe the full state
LSTM - Long Short-Term Memory networks for sequential processing
Bayes’ Theorem - Foundation for belief state updates
State Aggregation - Connection between partial observability and function approximation

References

Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and Acting in Partially Observable Stochastic Domains. Artificial Intelligence.
Littman, M. L., Sutton, R. S., & Singh, S. (2002). Predictive Representations of State. NIPS.
Hausknecht, M., & Stone, P. (2015). Deep Recurrent Q-Learning for Partially Observable MDPs. AAAI Fall Symposium.
Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2013). Playing Atari with Deep Reinforcement Learning. NIPS Deep Learning Workshop.

Study Notes

Explorer

RL-L13 - Partial Observability

RL Lecture 13 - Partial Observability

Overview & Motivation

Observations vs. States

Terminology

Histories and Markov Functions

Desired Properties of f(Ht​)

Agent Architecture

Approach 1: Belief States

Attempt 0 — Full History as State

Belief State Definition

Bayesian Belief Update

The Tiger Problem

Tiger Problem: Belief Update Walkthrough

Tiger Problem: Value Analysis at Different Belief States

Tiger Problem: Optimal Q-Values

Belief State Approach: Summary

Approach 2: Predictive State Representations

Motivation

Longer Tests

PSR: Tiger Problem Example

PSR: Summary

Approach 3: Approximate Methods

Using Observations Directly

Frame Stacking

Relation to State Aggregation

A Practical Note

Deep Recurrent Q-Learning (DRQN)

End-to-End Learned States: Summary

Comparison of Approaches

Exact Methods

Approximate Methods

A Note on Timeouts

Summary & Key Takeaways

New Concepts to Explore

References

Graph View

Table of Contents

Backlinks

Desired Properties of $f (H_{t})$