Explaining Tensorflow Code for a Convolutional Neural Network

Jessica YungArtificial Intelligence, Highlights, Self-Driving Car ND4 Comments

In this post, we will go through the code for a convolutional neural network. We will use Aymeric Damien’s implementation. I recommend you have a skim before you read this post. I have included the key portions of the code below.

If you’re not familiar with TensorFlow or neural networks, you may find it useful to read my post on multilayer perceptrons (a simpler neural network) first.

Feature image credits: Aphex34 (Wikimedia Commons)

1. Code

Here are the relevant network parameters and graph input for context (skim this, I’ll explain it below). This network is applied to MNIST data – scans of handwritten digits from 0 to 9 we want to identify.

Here is the model (I will explain this below):

2. Translating the code

Let’s draw the model the function conv_net represents. The batch size given is 128. That means that each time, at most 128 images are fed into our model.

The big picture:


In more detail:



We can see that there are fives types of layers here:

  • convolution layers,
  • max pooling layers,
  • layers for reshaping input,
  • fully-connected layers and
  • dropout layers.

2.1 What is conv2d (convolution layer)?

A convolution layer tries to extract higher-level features by replacing data for each (one) pixel with a value computed from the pixels covered by the e.g. 5×5 filter centered on that pixel(all the pixels in that region).


Credits: Fletcher Bach

We slide the filter across the width and height of the input and compute the dot products between the entries of the filter and input at each position. I explain this further when discussing tf.nn.conv2d() below.

Stanford’s CS231n course provides an excellent explanation of how convolution layers work (complete with diagrams). Here we will focus on the code.

This function comprises three parts:

  1. Conv2D layer from Tensorflow
    1. tf.nn.conv2d()
    2. This is analogous to xW (multiplying input by weights) in a fully connected layer.
  2. Add bias
  3. ReLU activation
    1. This transforms the output like so:ReLU(x) = max(0,x). (See previous post for more details).
    2. relu

      X axis: input, y axis: output. Credits: ml4a on GitHub

You can see it is structurally the same as a fully connected layer, except we multiply the input with weights in a different way.

Conv2D layer

The key part here is tf.nn.conv2d(). Let’s look at each of its arguments.

  • x is the input.
  • W are the weights.
    • The weights have four dimensions:  [filter_height, filter_width, input_depth, output_depth].
    • What this means is that we have  output_depth filters in this layer.
      • Each filter considers information with dimensions [filter_height, filter_width, input_depth] at a time. Yes, each filter goes through ALL the input depth layers.
        • 3d-cnn
      • This is like how, in a fully connected layer, we may have ten neurons, each of which interacts with all the neurons in the previous layer.
  • stride is the number of units the filter shifts each time.
    • Why are there four dimensions? This is because the input tensor has four dimensions: [number_of_samples, height, width, colour_channels].
    • strides = [1, strides, strides, 1] thus applies the filter to every image and every colour channel and to every  strides image patch in the height and width dimensions.
    • You don’t usually skip entire images or entire colour channels, so those positions are hardcoded as 1 here.
    • E.g. strides=[1, 2, 2, 1] would apply the filter to every other image patch in each dimension. (Image below has width stride 1.)
  • "SAME" padding: the output size is the same as the input size. This requires the filter window to shift out of the input map. The portions where the filter window is outside of the input map is the padding.
    • cd2339007250317325

      “SAME” padding: output (green) is the same size as the input (blue). Stride = 1.

    • The alternative is "VALID" padding, where there is no padding. The filter window stays inside the input map the whole time (in valid positions), so the output size is smaller than the input.
    • valid-padding.png

      ‘VALID’ padding.

2.2 What is maxpool2d (max_pool)?

Pooling layers reduce the spatial size of the output by replacing values in the kernel by a function of those values. E.g. in max pooling, you take the maximum out of every pool (kernel) as the new value for that pool.

Here the kernel is square and the kernel size is set to be the same as the stride. It resizes the input as shown in the diagram below:


Max Pooling with k = 2. Credits: CS231n

2.3 Layers for reshaping input

We reshape input twice in this model. The first time is at the beginning:

Recall the input was

That is, each sample inputted to the model was a one-dimensional array: an image flattened into a list of pixels. That is, the person who was preprocessing the MNIST dataset did this with each image:


And now we’re reversing the process:


The second time we reshape input is right after the convolutional layer before the first fully connected layer:

We’re doing this again to prepare 1D input for the fully connected layer:


2.4 What is a fully connected layer?


Do a Wx + b: For each neuron (number of neurons = number of outputs), multiply each input by a weight, sum all those products up and then add a bias to get your output for that neuron.

See post Explaining TensorFlow code for a Multilayer Perceptron.

2.5 What is Dropout?

Just before the output layer, we apply Dropout:

Dropout sets a proportion 1-dropout of activations (neuron outputs) passed on to the next layer to zero. The zeroed-out outputs are chosen randomly.

  • What happens if we set the dropout parameter to 0?

This reduces overfitting by checking that the network can provide the right output even if some activations are dropped out.

And that’s a wrap – hope you found this useful! If you enjoyed this or have any suggestions, do leave a comment.

Further reading:

4 Comments on “Explaining Tensorflow Code for a Convolutional Neural Network”

  1. l3v0

    great! I love you post. Could you explain: why we use “-1” in the shape parameter of
    x = tf.reshape(x, shape=[-1, 28, 28, 1]) thanks.

    1. Jessica Yung

      Hi there, putting -1 means that the length can take any value there. Specifically, the first position refers to the number of examples you feed in. So the -1 in the first position means you can feed in as many or as few examples as you like in one go.

      You want this because you may change your batch size (or your final batch may have a different number of examples if the total number of training examples may not be divisible by your batch size). Hope this helps!

  2. The Anh

    Thank you for your good explanation. If you don’t mind, please explain conv2d in the formulas. For example: We have an input I of shape [batch_size, w, h, i_channels] and weights (filters) W of shape [fw, fh, i_channels, o_channels]. So how does conv2d compute to output O = f(I, W) of shape [batch_size, w, h, o_channels] (in case of padding “SAME”).
    Thank you in advance 🙂

    1. Jessica Yung

      Hi there, thanks for leaving a message!

      The convolution demo in Stanford’s course on CNNs (CS231n) explains this well. Basically you take the dot product of the filter and the input once for every entry of your output volume, shifting the filter (width and height-wise) by stride units around the input to fill up the output.

      For TensorFlow’s implementation of conv2d, you can read the code for the function convolution here.

      For SAME padding, the output dimensions are ceil(input_dimensions/stride). We would pad the input with zeros if the input dimensions are not divisible by the stride.

      Let me know if you have further questions.

Leave a Reply