In this post I go through the algorithms presented in the groundbreaking AlphaGo Zero paper using pseudocode. The objective is to provide a high-level idea of what the model does.
Why AlphaGo Zero matters
Last week, Google DeepMind published their final iteration of AlphaGo, AlphaGo Zero. To say its performance is remarkable is an understatement. AlphaGo Zero made two breakthroughs:
- It was given no information other than the rules of the game.
- Previous versions of AlphaGo were given a large number of human games.
- It took a much shorter period of time to train and was trained on a single machine.
- It beat AlphaGo after training for three days and beat AlphaGo Master after training for only forty days (vs months).
- Note that it was trained with much less computational power than AlphaGo Master.
That is, it was able to achieve a level of performance way above current human world champions by training from scratch with no data apart from the rules of the game. Go is considered the most complex board game we’ve got. It’s much harder for machines to do well in Go than in chess (which is already hard).
In this post I will describe three algorithms:
- The core reinforcement learning algorithm, which makes heavy use of a neural network guided by Monte Carlo Tree Search,
- The Monte Carlo Tree Search (MCTS) algorithm, and
- How they train the neural network .
1. Reinforcement Learning algorithm
At its core, the model chooses the move recommended by Monte Carlo Tree Search guided by a neural network:
The Monte Carlo Tree Search serves as a policy improvement operator. That is, the actions chosen with MCTS are claimed to be much better than the direct recommendations of the neural network .
This high-level description abstracts away most of the information – we will now delve into MCTS and the training of the neural network to see how the model learns.
2. Monte Carlo Tree Search (MCTS)
This section is more complex, so I will explain the algorithm in words before showing the pseudocode. Here’s the outline:
- Choose a move that maximises , where is the action value and .
- is the mean action value.
- : If two moves have an equal action value, we choose the one that we’ve visited less often than we’d have expected. This encourages exploration.
- Execute that move. We are now at a different state (board position).
- Repeat the above until you reach a leaf node. Call this state .
- A leaf node is a position we haven’t explored.
- Input this position to the neural network. The network returns (1) a vector of probabilities and (2) the position’s (estimated) value.
- The position’s value is higher if the current player seems to have a higher chance of winning from that position (and vice versa).
- Update parameter values (visit count, action values) for all the edges involved using the neural network’s output.
And here’s the pseudocode:
In the last line, the action value for each edge is updated to be the mean evaluation over simulations where was reached after taking move from position .
After this, the model chooses an action to play from the root state proportionate to its exponentiated visit count .
3. Training the Neural Network
Finally, we will look at the neural network that the MCTS uses to evaluate positions and output probabilities.
The neural network comprises ‘convolutional blocks’ and ‘residual blocks’. Convolutional blocks apply (1) convolution layers, (2) batch normalisation and (3) ReLUs sequentially. Residual blocks comprise two convolution layers with a skip connection before the last rectifier nonlinearity (ReLU). If that confused you, you may find this post on convolutional neural networks helpful.
The model does two things in parallel. In the first thread, it gathers data from games it plays against itself. In the second thread, it trains the neural network using data from the game it just played. The processes are synchronised: when game is being played, the network is being trained on data from game . The network parameters are updated at the end of each iteration (each game).
Here’s the pseudocode:
The loss is a sum over the mean-squared error (MSE) and the cross-entropy loss, where is an L2 parameter to prevent overfitting.
That’s it! Let me know if you have any questions or suggestions in the comments section. You can read the original AlphaGo Zero blog post and paper here.