Explaining TensorFlow code for a Multilayer Perceptron

In this post we go through the code for a multilayer perceptron in TensorFlow. We will use Aymeric Damien’s implementation. I recommend you have a skim before you read this post. I have included the key portions of the code below.

1. Code

Here are the relevant network parameters and graph input for context (skim this):

Here is the model (I will explain this below):

2. Translating the code: What is a multilayer perceptron?

Let’s draw the model the function multilayer_perceptron represents. I’ll assume that we are handling a batch of 100 training examples:

It’s ‘multi-layer’ because there is more than one hidden layer. Here’s the code for a hidden layer again so you check it corresponds exactly:

In its entirety, the quoted code above does three things:

1. It assigns values to network parameters (number of features extracted from each hidden layer n_hidden_1 and n_hidden_2, number of features of input n_input and number of classes of output y n_classes).
2. It defines the model in model_perceptron()
3. It initialises and stores the weights and biases (W and b) for each of the three layers in the model.

What is the xW + b part?
We are giving weights to every feature (in the first layer this is each pixel, in the second layer this is every feature we extracted in the first layer). These weights are a proxy to how important the features are in making up the next set of features. I explain this in more detail in the second section of this post on the classification process.

Often the features in the hidden layers are not easily interpretable by humans.

What is ReLU activation?

$ReLU(x) = max(0,x)$.

The input is on the x-axis and the output of the ReLU (blue line) is indicated on the y-axis.

The important thing to note is that it’s non-linear (as opposed to the xW+b part, which is linear.)
Why do we need to add non-linearities? Because if not, the entire network could collapse to one layer.

Activation just means output. The linear activation in the last layer of this model means ‘return the output without doing anything more (like ReLU) to it’.

So the output returned is just `n_classes` numbers, one for each class. In the case of MNIST, that’s 10 numbers because there are 10 possible outcomes (images are digits ranging from 0 to 9).

Initialisation of the weights and biases: What is `tf.random_normal([a,b])`?

It generates a matrix with shape [a,b] (a rows and b columns) with values randomly drawn from a normal distribution with mean 0 and variance 1.

How do you choose the number of features extracted from each layer ( n_hidden_1 and n_hidden_2)?

An empirically-derived rule of thumb is ‘the optimal size of the hidden layer is usually between the size of the input and size of the output layers‘. Specifically, one could choose the number of features extracted (often referred to as neurons) in the layer to equal the mean of the number of features in the input and output layers. (Source: doug on Stats StackExchange)

There doesn’t seem to be a theory-based rule for choosing these values.

3. How would you adapt this model for different problems?

We have not discussed training the model yet, but if you wanted to use this model for a different problem, you would change the values of  n_input and n_classes.

For example, with our traffic sign classifier, n_input would equal 32*32*3 = 3072. Our image is 32 pixels by 32 pixels and each pixel has three colour channels (Red, Green, Blue). Note that n_input represents the number of input features each example has and not the number of examples you are feeding into your model. n_classes would equal 43 because we have 43 different traffic signs our image could possibly be.