Multi-Armed Bandit
Definition
-Armed Bandit Problem
A simplified RL problem with only one state. At each time step, you choose one of actions (“arms”), and receive a reward drawn from a stationary probability distribution that depends on the action selected. The goal: maximize total reward over time.
The Name
Named after slot machines (“one-armed bandits”) in casinos. Imagine facing slot machines, each with an unknown payoff distribution. Which do you play, and how do you decide?
Action Values
True Action Value
The true expected reward for action . Unknown to the agent.
Sample-Average Estimate
Average of rewards received for action so far. Converges to by Law of Large Numbers.
Action Selection Methods
Greedy
Pure exploitation. Can get stuck on suboptimal action.
ε-Greedy
Pick greedy action with probability , random action with probability .
UCB
Adds an exploration bonus that shrinks as an action is tried more often. controls exploration degree. More principled than ε-greedy — preferentially explores uncertain actions.
Optimistic Initial Values
Initialize high (e.g., +5 when rewards are around 0). Encourages initial exploration because early actual rewards will be “disappointing,” causing the agent to try other actions.
Gradient Bandit
Learn a preference for each action, select via softmax:
Update: where is the average reward baseline.
Incremental Update
Incremental Action-Value Update
General form:
For nonstationary problems, use constant step-size instead of : This gives exponentially decaying weights to old rewards (more weight on recent).
Relation to Full RL
Bandits as Special Case
A bandit is a 1-state MDP. There’s no state transition, no delayed reward, no sequential planning. It isolates the Exploration vs Exploitation problem.
Connections
- Special case of: Markov Decision Process (single state)
- Core problem: Exploration vs Exploitation
- Selection methods: Epsilon-Greedy Policy, Upper Confidence Bound
- Extended to: Contextual bandits (state-dependent), full MDPs