Multi-Armed Bandit

Definition

$k$ -Armed Bandit Problem

A simplified RL problem with only one state. At each time step, you choose one of $k$ actions (“arms”), and receive a reward drawn from a stationary probability distribution that depends on the action selected. The goal: maximize total reward over time.

The Name

Named after slot machines (“one-armed bandits”) in casinos. Imagine facing $k$ slot machines, each with an unknown payoff distribution. Which do you play, and how do you decide?

Action Values

True Action Value

$q_{*} (a) = E [R_{t} ∣ A_{t} = a]$

The true expected reward for action $a$ . Unknown to the agent.

Sample-Average Estimate

$Q_{t} (a) = \frac{\sum _{i = 1}^{t - 1} R _{i} \cdot 1 _{A_{i} = a}}{\sum _{i = 1}^{t - 1} 1 _{A_{i} = a}}$

Average of rewards received for action $a$ so far. Converges to $q_{*} (a)$ by Law of Large Numbers.

Action Selection Methods

Greedy

$A_{t} = ar g max_{a} Q_{t} (a)$ Pure exploitation. Can get stuck on suboptimal action.

ε-Greedy

Pick greedy action with probability $1 - ε$ , random action with probability $ε$ .

$A_{t} = ar g max_{a} [Q_{t} (a) + c \frac{l n t}{N _{t} ( a )}]$ Adds an exploration bonus that shrinks as an action is tried more often. $c$ controls exploration degree. More principled than ε-greedy — preferentially explores uncertain actions.

Optimistic Initial Values

Initialize $Q_{0} (a)$ high (e.g., +5 when rewards are around 0). Encourages initial exploration because early actual rewards will be “disappointing,” causing the agent to try other actions.

Gradient Bandit

Learn a preference $H_{t} (a)$ for each action, select via softmax: $π_{t} (a) = \frac{e ^{H_{t} (a)}}{\sum _{b} e ^{H_{t} (b)}}$

Update: $H_{t + 1} (a) = H_{t} (a) + α (R_{t} - \overset{ˉ}{R}_{t}) (1_{A_{t} = a} - π_{t} (a))$ where $\overset{ˉ}{R}_{t}$ is the average reward baseline.

Incremental Update

Incremental Action-Value Update

$Q_{n + 1} = Q_{n} + \frac{1}{n} [R_{n} - Q_{n}]$

General form: $NewEstimate \leftarrow OldEstimate + StepSize \times [Target - OldEstimate]$

For nonstationary problems, use constant step-size $α$ instead of $1/ n$ : $Q_{n + 1} = Q_{n} + α [R_{n} - Q_{n}]$ This gives exponentially decaying weights to old rewards (more weight on recent).

Relation to Full RL

Bandits as Special Case

A bandit is a 1-state MDP. There’s no state transition, no delayed reward, no sequential planning. It isolates the Exploration vs Exploitation problem.

Connections

Special case of: Markov Decision Process (single state)
Core problem: Exploration vs Exploitation
Selection methods: Epsilon-Greedy Policy, Upper Confidence Bound
Extended to: Contextual bandits (state-dependent), full MDPs

Study Notes

Explorer

Multi-Armed Bandit

Multi-Armed Bandit

Definition

Action Values

Action Selection Methods

Greedy

ε-Greedy

UCB

Optimistic Initial Values

Gradient Bandit

Incremental Update

Relation to Full RL

Connections

Appears In

Graph View

Table of Contents

Backlinks