SARSA

Definition

SARSA

SARSA is an on-policy TD control algorithm. It learns action-value function for the policy currently being followed. The name comes from the quintuple used in each update: .

Update Rule

SARSA Update

where:

  • is the actual next action chosen by the current policy (e.g., ε-greedy)
  • Contrast with Q-Learning which uses instead

Algorithm

Algorithm: SARSA (On-Policy TD Control)
───────────────────────────────────────
Initialize Q(s,a) arbitrarily for all s,a
  (Q(terminal, ·) = 0)
 
Loop for each episode:
  Initialize S
  Choose A from S using policy derived from Q (e.g., ε-greedy)
  Loop for each step of episode:
    Take action A, observe R, S'
    Choose A' from S' using policy derived from Q (e.g., ε-greedy)
    Q(S,A) ← Q(S,A) + α[R + γ Q(S',A') - Q(S,A)]
    S ← S';  A ← A'
  until S is terminal

On-Policy = What You Do Is What You Learn About

SARSA updates Q toward the value of the action it actually takes next (, chosen by the ε-greedy policy). This means it learns the value of the ε-greedy policy, including the cost of occasional random actions. This can lead to “safer” behavior in risky environments.

Cliff Walking Example

Cliff Walking: SARSA vs Q-Learning

In the Cliff Walking gridworld (a cliff along the bottom edge with -100 reward for falling off):

  • Q-Learning learns the optimal path: right along the cliff edge (shortest, but risky with ε-greedy exploration → occasional falls)
  • SARSA learns a safer path: goes around the top (longer but avoids cliff, because it accounts for the ε-probability of falling off during exploration)

Q-learning’s online performance is worse (more cliff falls during training) even though its learned policy is optimal. SARSA’s learned policy is suboptimal but safer under ε-greedy execution.

Connections

Appears In