AlphaGo Zero

AlphaGo Zero

A system that combines Monte Carlo Tree Search (MCTS) with deep neural networks to play Go (and later Chess and Shogi) at superhuman level. The neural network provides a policy prior $π_{θ} (a ∣ s)$ to guide tree search and a value estimate $V_{θ} (s)$ to evaluate leaf nodes, replacing random rollouts.

Key Innovation: Neural Network Guided MCTS

Standard MCTS uses random rollouts to estimate values and UCB1 for selection. AlphaGo Zero replaces both with neural networks:

Modified Tree Policy (Selection)

$π_{tree} (s) = ar g max_{a} [Q (s, a) + c \cdot \frac{π _{θ} ( a ∣ s ) \cdot N ( s )}{N ( s , a ) + 1}]$

where:

$Q (s, a)$ — average value from simulations through this state-action

$π_{θ} (a ∣ s)$ — neural network policy prior (replaces uniform prior in UCB1)

$N (s)$ — visit count of state $s$

$N (s, a)$ — visit count of action $a$ in state $s$

$c$ — exploration constant

What Neural Networks Provide

MCTS Component	Standard MCTS	AlphaGo Zero
Selection	UCB1 (uniform prior)	UCB with $\pi_\theta(a
Simulation	Random rollout to terminal state	$V_{θ} (s)$ evaluation at leaf
Effect on width	Explores all branches	Focuses on high-prior actions (limits width)
Effect on depth	Must rollout to end	Value function limits depth

Training

The neural network is trained through self-play:

Play games using MCTS (with current neural network) to select moves
Collect training data: for each position, record:
- State $s$
- MCTS search probabilities $π_{MCTS}$ (based on visit counts)
- Game outcome $z \in {- 1, + 1}$
Train neural network:
- Policy head: Cross-entropy loss — train $π_{θ} (a ∣ s)$ to match MCTS output $π_{MCTS}$
- Value head: MSE loss — train $V_{θ} (s)$ to predict game outcome $z$ (Monte Carlo target)
Evaluate: check if new network beats previous version
Repeat

The Virtuous Cycle

Better neural networks → better MCTS search → better training targets → even better neural networks. The policy network learns to “distill” the improved policy that MCTS computes, and the value network learns from game outcomes to evaluate positions more accurately.

Architecture

A single neural network with two heads:

Input: Board state $s$
Shared body: Deep residual network
Policy head: Outputs $π_{θ} (a ∣ s)$ — probability over all legal moves
Value head: Outputs $V_{θ} (s) \in [- 1, 1]$ — estimated probability of winning

Key Properties

No human knowledge: learns entirely from self-play (no expert games)
No random rollouts: neural network evaluation replaces simulation phase
Search still matters: the neural network guides but doesn’t replace MCTS — MCTS provides the actual decision and generates improved training targets
Cost: computationally very expensive (thousands of TPUs for training)

Connections

Builds on Monte Carlo Tree Search (MCTS) — the core planning algorithm
Uses Upper Confidence Bound ideas for tree selection
Part of Model-Based Reinforcement Learning — requires a game model (known rules)
Demonstrates the power of combining Deep Reinforcement Learning with planning

Appears In

RL-L12 - Model-Based RL
RL-Book Ch16 - Applications and Case Studies (§16.6)
Silver et al., “Mastering the game of Go without human knowledge” (2017)

Study Notes

Explorer

AlphaGo Zero

AlphaGo Zero

Key Innovation: Neural Network Guided MCTS

What Neural Networks Provide

Training

Architecture

Key Properties

Connections

Appears In

Graph View

Table of Contents

Backlinks