Upper Confidence Bound (UCB)

UCB Action Selection

where:

  • — estimated value (exploitation term)
  • — exploration bonus (decreases as action is tried more)
  • — number of times action has been selected
  • — controls degree of exploration

Optimism in the Face of Uncertainty

UCB adds a bonus to actions that haven’t been tried much. The less you know about an action, the higher its bonus. As you try it more, the bonus shrinks. This systematically explores uncertain options before settling on the best.

More principled than Epsilon-Greedy Policy — preferentially explores uncertain actions rather than exploring uniformly at random.

Appears In