Action-Value Methods

Definition

Action-Value Methods

Action-value methods estimate the value of each action — the expected reward (or return) of selecting it — and then use those estimates to drive action selection. The policy is implicit: it is derived from the value estimates (e.g. greedy or ε-greedy w.r.t. ), rather than being parameterised and learned directly. They are the canonical approach to the Multi-Armed Bandit problem and the conceptual ancestor of SARSA / Q-Learning in the full MDP setting.

Intuition

Estimate first, act second

The core loop is: keep a running estimate of how good each action is, then prefer the action that looks best (while still exploring). You never store a policy explicitly — you store numbers , and the policy is just “look at the numbers and pick”.

This is the natural contrast to Policy Gradient Methods: action-value methods learn and read off a policy; policy-gradient methods skip the values and adjust the policy parameters directly. In a bandit, “value of action ” is simply the expected reward ; in a full MDP it becomes the action-value function .

Mathematical Formulation

The true value of an action is its expected reward:

The sample-average estimate averages the rewards actually received for :

To avoid storing all past rewards, this is computed with the incremental update rule:

which has the general “error-correction” form

where:

  • — action selected at step ; — reward received at step
  • — true (unknown) expected reward of action
  • — current estimate of at step
  • — indicator, if action was taken at step , else
  • — number of times has been selected
  • — step size; the -th selection of the action uses step
  • — the error between the latest reward (target) and the current estimate

By the Law of Large Numbers, as each action is sampled infinitely often. For nonstationary problems, replace with a constant step size :

giving an exponential recency-weighted average (recent rewards weighted more heavily):

Key Properties / Variants

  • Selection rule is separate from estimation. Estimation gives ; a selection rule turns it into behaviour:
    • Greedy: — pure exploitation, can lock onto a suboptimal arm.
    • ε-greedy: greedy with prob. , uniform-random with prob. ; guarantees every action is sampled infinitely often so .
    • UCB: — directs exploration toward uncertain actions instead of exploring blindly.
    • Optimistic Initial Values: set high so early rewards “disappoint” and force trial of all actions; only aids initial exploration.
  • Step-size convergence (stochastic approximation): estimates converge w.p. 1 iff and . Sample-average () satisfies both; constant violates the second on purpose, so it keeps tracking a moving target.
  • Contrast with preference-based methods: Gradient bandit algorithms learn preferences via a softmax and stochastic gradient ascent — they do not estimate action values, so they are not action-value methods (they are the bandit-level analogue of policy gradient).
  • Scaling up: in a full MDP the same “estimate values, derive policy” principle gives TD control methods SARSA (on-policy) and Q-Learning (off-policy), where the target becomes a bootstrapped return rather than a single reward.
Algorithm: Simple Bandit (ε-greedy Action-Value Method)
─────────────────────────────────────────────────────────
Initialize, for a = 1..k:
    Q(a) ← 0          # value estimate
    N(a) ← 0          # selection count
 
Loop forever:
    # --- Action selection (policy derived from Q) ---
    With probability ε:   A ← random action
    Otherwise:            A ← argmax_a Q(a)   (ties broken randomly)
 
    # --- Take action, observe reward ---
    R ← bandit(A)
 
    # --- Incremental value update ---
    N(A) ← N(A) + 1
    Q(A) ← Q(A) + (1 / N(A)) · [R − Q(A)]

Connections

Appears In