AlphaGo Zero

AlphaGo Zero

A system that combines Monte Carlo Tree Search (MCTS) with deep neural networks to play Go (and later Chess and Shogi) at superhuman level. The neural network provides a policy prior to guide tree search and a value estimate to evaluate leaf nodes, replacing random rollouts.

Key Innovation: Neural Network Guided MCTS

Standard MCTS uses random rollouts to estimate values and UCB1 for selection. AlphaGo Zero replaces both with neural networks:

Modified Tree Policy (Selection)

where:

  • — average value from simulations through this state-action
  • — neural network policy prior (replaces uniform prior in UCB1)
  • — visit count of state
  • — visit count of action in state
  • — exploration constant

What Neural Networks Provide

MCTS ComponentStandard MCTSAlphaGo Zero
SelectionUCB1 (uniform prior)UCB with $\pi_\theta(a
SimulationRandom rollout to terminal state evaluation at leaf
Effect on widthExplores all branchesFocuses on high-prior actions (limits width)
Effect on depthMust rollout to endValue function limits depth

Training

The neural network is trained through self-play:

  1. Play games using MCTS (with current neural network) to select moves
  2. Collect training data: for each position, record:
    • State
    • MCTS search probabilities (based on visit counts)
    • Game outcome
  3. Train neural network:
    • Policy head: Cross-entropy loss — train to match MCTS output
    • Value head: MSE loss — train to predict game outcome (Monte Carlo target)
  4. Evaluate: check if new network beats previous version
  5. Repeat

The Virtuous Cycle

Better neural networks → better MCTS search → better training targets → even better neural networks. The policy network learns to “distill” the improved policy that MCTS computes, and the value network learns from game outcomes to evaluate positions more accurately.

Architecture

A single neural network with two heads:

  • Input: Board state
  • Shared body: Deep residual network
  • Policy head: Outputs — probability over all legal moves
  • Value head: Outputs — estimated probability of winning

Key Properties

  • No human knowledge: learns entirely from self-play (no expert games)
  • No random rollouts: neural network evaluation replaces simulation phase
  • Search still matters: the neural network guides but doesn’t replace MCTS — MCTS provides the actual decision and generates improved training targets
  • Cost: computationally very expensive (thousands of TPUs for training)

Connections

Appears In