RL Lecture 11 - SAC, Decision Transformer & Diffuser
Overview & Motivation
This lecture has two major parts. First, we wrap up policy gradient methods by introducing Soft Actor-Critic (SAC), a high-performing deep RL method that augments the standard RL objective with an entropy bonus to encourage exploration and robustness. Second, we take a fundamentally different perspective on policy learning: instead of estimating value functions or computing policy gradients, we ask why not just imitate the good bits of logged trajectories? This leads to the Decision Transformer and Decision Diffuser, which recast RL as a supervised sequence modeling or generative modeling problem.
The core tension addressed: value-based methods (DQN, DDPG) are sample-efficient but brittle; stochastic Policy Gradient Methods are robust but need on-policy samples. SAC gets the best of both worlds. The Decision Transformer/Diffuser sidestep value estimation entirely.
Part 1: Soft Actor-Critic (SAC)
Motivation
Intuition
Three key observations motivate SAC:
- Off-policy learning is important for sample efficiency, but maximizing a Q-function directly (DQN, DDPG) is brittle — small errors in Q can lead to catastrophic policy changes
- Stochastic policy gradients (REINFORCE and similar) need on-policy samples, making them sample-inefficient
- We want an off-policy stochastic actor-critic method that combines the benefits of both approaches
The goal is to develop a method with:
- Off-policy learning (sample-efficient, can reuse data from a replay buffer)
- Stochastic policies (robust, good exploration)
- Stable Q-function optimization (less brittle than pure Q-maximization)
The Maximum Entropy RL Objective
SAC augments the standard RL objective with an entropy term that encourages the policy to remain stochastic:
Formula
SAC Augmented Objective:
where is the entropy of the policy at state :
Expanding the entropy term explicitly:
Which simplifies to:
Definition
Maximum Entropy RL: A framework where the agent maximizes both expected cumulative reward and the entropy of its policy. The entropy bonus acts as a built-in regularizer that prevents premature convergence to deterministic policies.
The Role of Temperature
The parameter (temperature) controls the tradeoff between reward maximization and entropy:
| High | Low |
|---|---|
| More exploration | More exploitation |
| More random/stochastic policy | More greedy/deterministic policy |
| Prioritizes entropy | Prioritizes reward |
Tip
The lecture notes assume for simplicity, since different values of are equivalent to rescaling the reward . In practice, SAC can automatically tune during training.
Why Entropy Matters
Intuition
The entropy objective does more than just encourage exploration at the current state. Because entropy appears inside the sum over all timesteps, the policy is also incentivized to reach future states where it can maintain high entropy. This means the agent seeks out states where it has many viable options, leading to more robust behavior.
Soft Policy Iteration (Tabular Case)
Analogous to regular policy iteration, SAC iterates between two steps:
- Soft value iteration (policy evaluation)
- Soft policy improvement
Soft Bellman Equation
The soft Bellman operator modifies the standard Bellman equation to include the entropy bonus:
Formula
Soft Bellman Operator:
This defines the soft value function:
So the soft Q-function satisfies:
Intuition
The soft value function incorporates the entropy bonus: a state is valuable not just because of high expected reward, but also because the agent has many good action choices there (high entropy).
Soft Policy Improvement
The soft policy improvement step finds the policy that minimizes the KL divergence to an energy-based policy derived from the current Q-function:
Formula
Soft Policy Improvement:
where is a partition function (normalizing constant).
Intuition
This update says: make the new policy as close as possible to the Boltzmann distribution induced by the current Q-function. Actions with high Q-values get high probability, but the KL divergence constraint keeps the policy spread out (stochastic).
Deep SAC: Scaling Beyond Tabular
In the deep learning setting, we parameterize three networks:
- Q-networks: and (two Q-networks)
- Value network: (not strictly necessary, but reduces variance by avoiding a sampling step)
- Policy network:
All networks are updated with alternating gradient-based updates.
Loss Functions
Formula
Value Network Loss (sample from current policy, not replay buffer):
Q-Network Loss (using target network parameters ):
Policy Loss
The policy gradient is derived from the KL divergence objective:
Expanding:
Tip
The partition function is a constant with respect to , so it disappears in the gradient. This is a key simplification.
One could compute this gradient with REINFORCE-style estimators, but we can do better using the reparameterization trick.
The Reparameterization Trick
Definition
Reparameterization Trick: Instead of sampling directly (which blocks gradient flow), we write the action as a deterministic function of the state and an independent noise variable: This allows gradients to flow through the sampling operation.
Using reparameterization, the policy gradient becomes:
Formula
Reparameterized Policy Gradient:
This expression has two interpretable components:
- Entropy maximization term: — pushes the policy to increase entropy
- DDPG-like term: — pushes the policy toward actions with high Q-values, analogous to DDPG
Intuition
The reparameterization trick makes SAC’s policy update similar to DDPG (using known Q-function gradients) while additionally maximizing entropy. This is much lower variance than REINFORCE-style gradient estimation.
Full SAC Algorithm
Formula
SAC Algorithm (Deep Version):
Initialize: Q_w1, Q_w2, V_psi, pi_phi, target V_psi_bar Initialize replay buffer D for each iteration do: # Environment interaction for each environment step do: a ~ pi_phi(a|s) # Sample action from policy s' ~ p(s'|s, a) # Step environment D <- D ∪ {(s, a, r, s')} # Store transition in replay buffer end for # Gradient updates (one or multiple steps) for each gradient step do: Sample minibatch from D # Update V network psi <- psi - lambda_V * nabla_psi J_V(psi) # Update Q networks (both) w_i <- w_i - lambda_Q * nabla_w_i J_Q(w_i) for i = 1, 2 # Update policy phi <- phi - lambda_pi * nabla_phi J_pi(phi) # Update target network (soft update) psi_bar <- tau * psi + (1 - tau) * psi_bar end for end for
Key implementation details:
- Two Q-functions: and . The minimum of the two is used to counter optimistic Q-value overestimation (optimization bias), similar to the twin critics in TD3
- Target Network: Used for stable Q-learning targets
- Experience Replay: Off-policy data is sampled from a replay buffer
- In practice, 1 environment step is taken per iteration, with one or multiple gradient steps using data from the replay buffer
SAC Results
Summary
Empirical Performance:
- More consistent than DDPG: SAC avoids the instability and brittleness of pure Q-maximization (DDPG results shown in bottom row of comparison plots)
- Faster learning than PPO: The off-policy nature gives SAC better sample efficiency
- Automatic temperature scheduling: The automatically tuned temperature (shown in blue in experiments) works about as well as manually tuned temperature per task (shown in orange)
- Real-world robotics: SAC also works on real-world robotics tasks including walking and manipulation from pixels
SAC Conclusions
Summary
SAC Key Properties:
- Off-policy stochastic Actor-Critic method
- Entropy-maximizing loss keeps the policy stochastic, which:
- Increases robustness to perturbations
- Improves exploration
- Combines benefits of stochastic policy gradients and Q-function maximization:
- Less-greedy update (more robust) with stochastic exploration
- Limited maximization of a Q-function (sample efficient)
Part 2: RL as Supervised Learning
Motivation: Why Not Just Imitate?
Intuition
RL approaches based on learning value functions and/or policies tend to be somewhat “fiddly” and don’t always work. This is especially true in the offline RL case, where we learn from a fixed dataset without further environment interaction. (Recall the difficulties with Conservative Q-Learning (CQL)!)
Meanwhile, transformers and diffusion models trained with supervised learning in language and vision tend to perform more robustly. Can we leverage the strength of supervised learning for RL?
The naive approach — behavioral cloning (just imitate all demonstrations) — has clear limitations:
- Needs high-quality demonstrations
- Cannot in principle do better than the demonstrations
But what if we only have mediocre demonstrations (e.g., from exploration)?
Key Idea: Imitate the Good Bits
We could attempt to only imitate the good trajectories and/or only the “good bits” of trajectories. This is not a new idea:
- Reward-weighted regression and PoWER up-weight trajectories with positive returns (but assume linear policies)
- Upside-down RL and reward-conditioned policies explored conditioning policies on specific desired returns
The breakthrough in the deep learning era: combining conditioning on (good) rewards with modern deep learning architectures (transformers, diffusion models):
Decision Transformer
Core Idea
Definition
Decision Transformer: Treats RL as a sequence modeling problem rather than a value estimation problem. It predicts actions autoregressively as a function of the trajectory so far and the desired return-to-go, using a GPT-style transformer architecture.
The key insight: instead of estimating or and deriving a policy, directly predict what action to take given the trajectory history and a target return level.
Trajectory Preprocessing
The trajectory is represented as an interleaved sequence of returns-to-go, states, and actions:
Formula
Trajectory Representation:
where is the return-to-go (sum of future rewards from timestep onwards).
Intuition
The return-to-go tells the model “this is the total reward achieved from this point onwards.” By conditioning on a desired return-to-go at test time, we can control the performance level of the generated behavior.
Architecture
- GPT architecture: A causally masked transformer (autoregressive)
- Input: Return-to-go, state, and action tokens from the last timesteps (context window)
- Output: Predicted next action
- Training loss: Cross-entropy (for discrete actions) or mean squared error (for continuous actions)
Input tokens: [G_0] [s_0] [a_0] [G_1] [s_1] [a_1] ... [G_t] [s_t] [?]
| | | | | | | | |
[Causal Transformer with positional embeddings]
| | |
Output: predict action a_t
Inference (Test Time)
At test time:
- Start with initial state and a desired return-to-go (e.g., the maximum return seen during training)
- The model predicts action
- Execute , observe and
- Update return-to-go:
- Feed to predict
- Continue autoregressively
Tip
By setting the desired return-to-go to a high value, the model generates behavior that achieves high rewards. This is analogous to prompting a language model — you “prompt” the policy with the performance level you want.
Results & Discussion
- Good performance on several offline RL benchmarks
- Better than training on only the best trajectories: Other (mediocre) trajectories help generalization
- Context helps: or outperforms , suggesting the environment appears non-Markov or that policy history provides useful hints for future actions
- Robust to sparse reward settings: Because it directly models returns rather than bootstrapping value estimates
- No value optimization: No need for pessimism or regularization (unlike CQL)
Warning
The Decision Transformer is primarily designed for offline RL — it learns from a fixed dataset of logged trajectories without online interaction with the environment.
Decision Diffuser
Core Idea
Definition
Decision Diffuser: Uses a diffusion model to generate future state trajectories with desired properties, then derives actions using an inverse dynamics model. Unlike the Decision Transformer, it can condition on multiple types of guidance beyond just return: constraints, skills, and their combinations.
Architecture & Approach
The Decision Diffuser operates in two stages:
- Trajectory generation: A diffusion model generates a future state trajectory
- Action extraction: A separate inverse dynamics model maps consecutive state pairs to actions:
Intuition
Why generate only states and not actions?
- In robotics domains, states tend to be continuous and smooth (positions, velocities)
- Actions can be jerky or discrete (torques, motor commands)
- Diffusion models work better on smooth, continuous data
- The inverse dynamics model handles the state-to-action mapping separately
Conditioning with Classifier-Free Guidance
The key advantage of the Decision Diffuser is its flexible conditioning mechanism. During the reverse (denoising) process of the diffusion model, classifier-free guidance adds a bias toward trajectories with desired properties:
Formula
Conditioning Types:
- Maximize return: Condition the noise prediction on (the maximum normalized return)
- Satisfy constraints: Condition on constraint identity (one-hot encoding), e.g.,
- Compose skills: Condition on skill identity (one-hot encoding)
With classifier-free guidance, during training the conditioning information is provided with probability and dropped with probability . At inference, the model can interpolate between the conditioned and unconditioned predictions to steer generation.
Training Details
Formula
Training Components:
- Diffusion model : Predicts the noise applied in the forward diffusion process, conditioned on desired properties (with probability )
- Inverse dynamics model : Predicts actions from consecutive state pairs
Architecture:
- : Temporal U-Net, with conditioning information projected to a latent vector via an MLP
- : MLP
Low-temperature sampling: During denoising at inference time, reduce the variance of predicted noise to get more deterministic, high-quality trajectories
Intuition
Low-temperature sampling is analogous to reducing the temperature in language model generation — it makes the output more focused and less random, favoring the most likely trajectories.
Results
- Offline RL: Competitive with or outperforms baselines such as behavioral cloning, Conservative Q-Learning (CQL), and Decision Transformer
- Constraints: Better at satisfying single constraints than baselines, and the only model that can combine multiple constraints (e.g., AND )
- Skills: Manages to generate behavior somewhat “in between” separately learned skills
Related Work: Decision Transformer / Diffuser Family
Several works are closely related:
| Method | Key Difference |
|---|---|
| Upside-down RL | Single state input (no sequence model) |
| Trajectory Transformer | Similar to (concurrent with) Decision Transformer, but also uses state and return predictions |
| Diffuser (Janner et al.) | Uses classifier guidance (gradient of estimated returns) instead of classifier-free guidance; predicts state-action pairs instead of states only |
Conclusions on Decision Transformer / Diffuser
Summary
- Promising alternative to methods that estimate values (critic-only or actor-critic)
- Open question: Do they share weaknesses of methods based on Monte Carlo returns? (They rely on observed returns rather than bootstrapped estimates)
- Mostly aimed at offline RL — whether they are equally promising for online RL remains an open question
The Big Picture: RL Methods Landscape
RL Methods
|
________________|________________
| |
Action-value based Policy-based
| |
SARSA, Q-learning, MC Policy Gradient Methods
Gradient MC / | | \
Semi-gradient TD REINFORCE PGT Actor-Critic DPG/DDPG
GTD2, etc | |
G(PO)MDP SAC
|
Critic only <---------> Actor-Critic <---------> Actor only
|
Decision Transformer
Decision Diffuser
How to Learn Policies — Three Paradigms
Given data :
- Value-based: Learn or , derive policy from values
- Policy-based: Directly optimize using policy gradients or supervised learning
- Model-based: Learn dynamics and reward , then plan or learn value/policy using the model
What You Should Know
Summary
Exam-relevant takeaways from this lecture:
- Soft Actor-Critic:
- Off-policy stochastic actor-critic
- How does the chosen loss (entropy-augmented objective) result in a robust and efficient method?
- The entropy term keeps the policy stochastic (robustness, exploration) while Q-function gradients drive efficient learning
- Main properties of policy improvement using supervised learning:
- Can imitate only the good parts of trajectories by conditioning on desired returns
- No need for value function bootstrapping, pessimism, or regularization
- Trajectory preprocessing:
- Representing trajectories as sequences of triples with return-to-go
- Main idea behind Decision Transformer, Decision Diffuser, and their differences:
- Decision Transformer: autoregressive action prediction conditioned on return-to-go
- Decision Diffuser: diffusion-based trajectory generation with flexible conditioning (returns, constraints, skills) + inverse dynamics for actions
New Concepts to Explore
The following concepts are introduced or referenced and warrant deeper study:
- Soft Actor-Critic (SAC) - Off-policy maximum entropy actor-critic method
- Maximum Entropy RL - Framework augmenting RL with entropy bonuses
- Decision Transformer - RL via autoregressive sequence modeling conditioned on returns
- Decision Diffuser - RL via conditional diffusion trajectory generation
- Offline Reinforcement Learning - Learning from fixed datasets without online interaction
- Reparameterization Trick - Gradient-friendly sampling via deterministic transformations of noise
- Classifier-Free Guidance - Conditioning mechanism for diffusion models without a separate classifier
- Inverse Dynamics Model - Predicting actions from consecutive state pairs
- Reward-Weighted Regression - Weighting policy updates by trajectory returns
- Upside-Down RL - Conditioning policies on desired returns (precursor to Decision Transformer)
References
- Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. ICML.
- Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., … & Levine, S. (2018). Soft Actor-Critic Algorithms and Applications. arXiv preprint arXiv:1812.05905.
- Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P., Srinivas, A., & Mordatch, I. (2021). Decision Transformer: Reinforcement Learning via Sequence Modeling. NeurIPS.
- Ajay, A., Du, Y., Gupta, A., Tenenbaum, J. B., Jaakkola, T. S., & Agrawal, P. (2023). Is Conditional Generative Modeling all you need for Decision Making? ICLR.
- Janner, M., Li, Q., & Levine, S. (2021). Offline Reinforcement Learning as One Big Sequence Modeling Problem. NeurIPS.
- Janner, M., Du, Y., Tenenbaum, J. B., & Levine, S. (2022). Planning with Diffusion for Flexible Behavior Synthesis. ICML.
- Peters, J., & Schaal, S. (2007). Reinforcement Learning by Reward-Weighted Regression for Operational Space Control. ICML.
- Srivastava, R. K., Shyam, P., Muber, F., Elber, M., & Schmidhuber, J. (2019). Training Agents using Upside-Down Reinforcement Learning. CoRR.
- Kumar, A., Peng, X. B., & Levine, S. (2019). Reward-Conditioned Policies. arXiv preprint arXiv:1912.13465.