# AlphaGo Zero: An overview of the algorithm In this post I go through the algorithms presented in the groundbreaking AlphaGo Zero paper using pseudocode. The objective is to provide a high-level idea of what the model does.

#### Why AlphaGo Zero matters

Last week, Google DeepMind published their final iteration of AlphaGo, AlphaGo Zero. To say its performance is remarkable is an understatement. AlphaGo Zero made two breakthroughs:

1. It was given no information other than the rules of the game.
1. Previous versions of AlphaGo were given a large number of human games.
2. It took a much shorter period of time to train and was trained on a single machine.
1. It beat AlphaGo after training for three days and beat AlphaGo Master after training for only forty days (vs months).
2. Note that it was trained with much less computational power than AlphaGo Master.

That is, it was able to achieve a level of performance way above current human world champions by training from scratch with no data apart from the rules of the game. Go is considered the most complex board game we’ve got. It’s much harder for machines to do well in Go than in chess (which is already hard).

In this post I will describe three algorithms:

1. The core reinforcement learning algorithm, which makes heavy use of a neural network $f_{\theta}$ guided by Monte Carlo Tree Search,
2. The Monte Carlo Tree Search (MCTS) algorithm, and
3. How they train the neural network $f_{\theta}$.

#### 1. Reinforcement Learning algorithm At its core, the model chooses the move recommended by Monte Carlo Tree Search guided by a neural network: $\text{for each board position s:} \\ \indent\text{MCTS guided by neural network }f_{\theta} \rightarrow \Pr\vec{\pi}\text{ of playing each move} \\ \indent\text{choose move }a = argmax \vec{\pi}$

The Monte Carlo Tree Search serves as a policy improvement operator. That is, the actions chosen with MCTS are claimed to be much better than the direct recommendations of the neural network $f_{\theta}$.

This high-level description abstracts away most of the information – we will now delve into MCTS and the training of the neural network to see how the model learns.

#### 2. Monte Carlo Tree Search (MCTS) This section is more complex, so I will explain the algorithm in words before showing the pseudocode. Here’s the outline:

1. Choose a move that maximises $Q(s,a) + U(s,a)$, where $Q(s,a)$ is the action value and $U(s,a)\propto \frac{P(s,a)}{1+N(s,a)}$.
1. Intuition:
1. $Q(s,a)$ is the mean action value.
2. $U(s,a)$: If two moves have an equal action value, we choose the one that we’ve visited less often than we’d have expected. This encourages exploration.
2. Execute that move. We are now at a different state (board position).
3. Repeat the above until you reach a leaf node. Call this state $s'$.
1. A leaf node is a position we haven’t explored.
2. 4. Input this position to the neural network. The network returns (1) a vector of probabilities and (2) the position’s (estimated) value.
1. The position’s value is higher if the current player seems to have a higher chance of winning from that position (and vice versa).
5. Update parameter values (visit count, action values) for all the edges involved using the neural network’s output.

And here’s the pseudocode: $\text{Start at root state s} \\ \text{while state is not at leaf node:} \\ \indent\text{choose move } a = argmax(Q(s,a) + U(s,a)) \\ \indent\text{append edge (s,a) to visited\_in\_this\_sim} \\ \indent\text{execute move }a \rightarrow \text{new state }s'\\ \text{expand leaf position }s' \\ \text{evaluate }f_{\theta}(s') \rightarrow P(s',\dot), V(s') \\ \text{for edge (s,a) in visited\_in\_this\_sim:} \\ \indent\text{increment visit count }N(s,a)\\ \indent\text{update action value }Q(s,a) = \frac{1}{N(s,a)}\sum_{s'|(s,a\rightarrow s')}V(s')$

In the last line, the action value for each edge $Q(s,a)$ is updated to be the mean evaluation $V(s')$ over simulations where $s'$ was reached after taking move $a$ from position $s$.

After this, the model chooses an action to play from the root state proportionate to its exponentiated visit count $\pi(a|s_0)=N(s_0,a)^{1/\tau}/\sum_bN(s_0,b)^{1/\tau}$.

#### 3. Training the Neural Network Finally, we will look at the neural network that the MCTS uses to evaluate positions and output probabilities.

The neural network comprises ‘convolutional blocks’ and ‘residual blocks’. Convolutional blocks apply (1) convolution layers, (2) batch normalisation and (3) ReLUs sequentially. Residual blocks comprise two convolution layers with a skip connection before the last rectifier nonlinearity (ReLU). If that confused you, you may find this post on convolutional neural networks helpful.

The model does two things in parallel. In the first thread, it gathers data from games it plays against itself. In the second thread, it trains the neural network using data from the game it just played. The processes are synchronised: when game $i$ is being played, the network is being trained on data from game $i-1$. The network parameters are updated at the end of each iteration (each game).

Here’s the pseudocode: $\text{initialise random weights }\theta_0 \\ \text{for each iteration(game) }i: \\ \indent\text{thread 1:} \\ \indent\indent\text{for each timestep (move) }t: \\ \indent\indent\indent\text{execute MCTS }\pi_t = \alpha_{\theta_{i-1}}(s_t)\text{ using }f_{\theta_{i-1}} \\ \indent\indent\indent\text{play move }a=argmax\vec{\pi_t} \\ \indent\indent\indent\text{if game\_end\_conditions\_met:} \\ \indent\indent\indent\indent\text{break} \\ \indent\indent\text{score game to give final reward }r_T\in{-1,1} \\ \indent\indent\text{for each timestep t:} \\ \indent\indent\indent z_t = r_t\text{ from perspective of current player (player playing that move)} \\ \indent\indent\indent\text{store data }(s_t, \pi_t, z_t) \\ \indent\text{thread 2:} \\ \indent\indent\text{train network on data sampled uniformly from all }t\text{ of game }(i-1) \\ \indent\indent\text{loss }l=(z-v)^2 - \pi^T\log p + c||\theta||^2 \\ \indent\indent\text{adjust }\theta\text{ by gradient descent on loss }l \\$

The loss $l$ is a sum over the mean-squared error (MSE) and the cross-entropy loss, where $c$ is an L2 parameter to prevent overfitting.

That’s it! Let me know if you have any questions or suggestions in the comments section. You can read the original AlphaGo Zero blog post and paper here.