Multi-Armed Bandit

Definition

-Armed Bandit Problem

A simplified RL problem with only one state. At each time step, you choose one of actions (“arms”), and receive a reward drawn from a stationary probability distribution that depends on the action selected. The goal: maximize total reward over time.

The Name

Named after slot machines (“one-armed bandits”) in casinos. Imagine facing slot machines, each with an unknown payoff distribution. Which do you play, and how do you decide?

Action Values

True Action Value

The true expected reward for action . Unknown to the agent.

Sample-Average Estimate

Average of rewards received for action so far. Converges to by Law of Large Numbers.

Action Selection Methods

Greedy

Pure exploitation. Can get stuck on suboptimal action.

ε-Greedy

Pick greedy action with probability , random action with probability .

UCB

Adds an exploration bonus that shrinks as an action is tried more often. controls exploration degree. More principled than ε-greedy — preferentially explores uncertain actions.

Optimistic Initial Values

Initialize high (e.g., +5 when rewards are around 0). Encourages initial exploration because early actual rewards will be “disappointing,” causing the agent to try other actions.

Gradient Bandit

Learn a preference for each action, select via softmax:

Update: where is the average reward baseline.

Incremental Update

Incremental Action-Value Update

General form:

For nonstationary problems, use constant step-size instead of : This gives exponentially decaying weights to old rewards (more weight on recent).

Relation to Full RL

Bandits as Special Case

A bandit is a 1-state MDP. There’s no state transition, no delayed reward, no sequential planning. It isolates the Exploration vs Exploitation problem.

Connections

Appears In