SARSA
Definition
SARSA
SARSA is an on-policy TD control algorithm. It learns action-value function for the policy currently being followed. The name comes from the quintuple used in each update: .
Update Rule
SARSA Update
where:
- is the actual next action chosen by the current policy (e.g., ε-greedy)
- Contrast with Q-Learning which uses instead
Algorithm
Algorithm: SARSA (On-Policy TD Control)
───────────────────────────────────────
Initialize Q(s,a) arbitrarily for all s,a
(Q(terminal, ·) = 0)
Loop for each episode:
Initialize S
Choose A from S using policy derived from Q (e.g., ε-greedy)
Loop for each step of episode:
Take action A, observe R, S'
Choose A' from S' using policy derived from Q (e.g., ε-greedy)
Q(S,A) ← Q(S,A) + α[R + γ Q(S',A') - Q(S,A)]
S ← S'; A ← A'
until S is terminalOn-Policy = What You Do Is What You Learn About
SARSA updates Q toward the value of the action it actually takes next (, chosen by the ε-greedy policy). This means it learns the value of the ε-greedy policy, including the cost of occasional random actions. This can lead to “safer” behavior in risky environments.
Cliff Walking Example
Cliff Walking: SARSA vs Q-Learning
In the Cliff Walking gridworld (a cliff along the bottom edge with -100 reward for falling off):
- Q-Learning learns the optimal path: right along the cliff edge (shortest, but risky with ε-greedy exploration → occasional falls)
- SARSA learns a safer path: goes around the top (longer but avoids cliff, because it accounts for the ε-probability of falling off during exploration)
Q-learning’s online performance is worse (more cliff falls during training) even though its learned policy is optimal. SARSA’s learned policy is suboptimal but safer under ε-greedy execution.
Connections
- Instance of: Temporal Difference Learning (on-policy control)
- Compared with: Q-Learning (off-policy), Expected SARSA
- Extended by: Semi-gradient SARSA (with Function Approximation)
- Framework: Generalized Policy Iteration