In this post, we will go through the code for a convolutional neural network. We will use Aymeric Damien’s implementation. I recommend you have a skim before you read this post. I have included the key portions of the code below.
If you’re not familiar with TensorFlow or neural networks, you may find it useful to read my post on multilayer perceptrons (a simpler neural network) first.
Feature image credits: Aphex34 (Wikimedia Commons)
1. Code
Here are the relevant network parameters and graph input for context (skim this, I’ll explain it below). This network is applied to MNIST data – scans of handwritten digits from 0 to 9 we want to identify.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
# Parameters learning_rate = 0.001 training_iters = 200000 batch_size = 128 display_step = 10 # Network Parameters n_input = 784 # MNIST data input (img shape: 28*28) n_classes = 10 # MNIST total classes (09 digits) dropout = 0.75 # Dropout, probability to keep units # tf Graph input x = tf.placeholder(tf.float32, [None, n_input]) # input, i.e. pixels that constitute the image y = tf.placeholder(tf.float32, [None, n_classes]) # labels, i.e which digit the image is keep_prob = tf.placeholder(tf.float32) #dropout (keep probability) 
Here is the model (I will explain this below):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 
# Create model def conv_net(x, weights, biases, dropout): # Reshape input picture x = tf.reshape(x, shape=[1, 28, 28, 1]) # Convolution Layer conv1 = conv2d(x, weights['wc1'], biases['bc1']) # Max Pooling (downsampling) conv1 = maxpool2d(conv1, k=2) # Convolution Layer conv2 = conv2d(conv1, weights['wc2'], biases['bc2']) # Max Pooling (downsampling) conv2 = maxpool2d(conv2, k=2) # Reshape conv2 output to fit fully connected layer input fc1 = tf.reshape(conv2, [1, weights['wd1'].get_shape().as_list()[0]]) # Fully connected layer fc1 = tf.add(tf.matmul(fc1, weights['wd1']), biases['bd1']) fc1 = tf.nn.relu(fc1) # Apply Dropout fc1 = tf.nn.dropout(fc1, dropout) # Output, class prediction out = tf.add(tf.matmul(fc1, weights['out']), biases['out']) return out # Store layers weight & bias weights = { # 5x5 conv, 1 input, 32 outputs 'wc1': tf.Variable(tf.random_normal([5, 5, 1, 32])), # 5x5 conv, 32 inputs, 64 outputs 'wc2': tf.Variable(tf.random_normal([5, 5, 32, 64])), # fully connected, 7*7*64 inputs, 1024 outputs 'wd1': tf.Variable(tf.random_normal([7*7*64, 1024])), # 1024 inputs, 10 outputs (class prediction) 'out': tf.Variable(tf.random_normal([1024, n_classes])) } biases = { 'bc1': tf.Variable(tf.random_normal([32])), 'bc2': tf.Variable(tf.random_normal([64])), 'bd1': tf.Variable(tf.random_normal([1024])), 'out': tf.Variable(tf.random_normal([n_classes])) } # Construct model pred = conv_net(x, weights, biases, keep_prob) 
2. Translating the code
Let’s draw the model the function conv_net represents. The batch size given is 128. That means that each time, at most 128 images are fed into our model.
The big picture:
In more detail:
We can see that there are fives types of layers here:
 convolution layers,
 max pooling layers,
 layers for reshaping input,
 fullyconnected layers and
 dropout layers.
2.1 What is conv2d (convolution layer)?
A convolution layer tries to extract higherlevel features by replacing data for each (one) pixel with a value computed from the pixels covered by the e.g. 5×5 filter centered on that pixel(all the pixels in that region).
We slide the filter across the width and height of the input and compute the dot products between the entries of the filter and input at each position. I explain this further when discussing tf.nn.conv2d() below.
Stanford’s CS231n course provides an excellent explanation of how convolution layers work (complete with diagrams). Here we will focus on the code.
1 2 3 4 5 
def conv2d(x, W, b, strides=1): # Conv2D wrapper, with bias and relu activation x = tf.nn.conv2d(x, W, strides=[1, strides, strides, 1], padding='SAME') x = tf.nn.bias_add(x, b) return tf.nn.relu(x) 
This function comprises three parts:
 Conv2D layer from Tensorflow
 tf.nn.conv2d()
 This is analogous to xW (multiplying input by weights) in a fully connected layer.
 Add bias
 ReLU activation
 This transforms the output like so:. (See previous post for more details).
You can see it is structurally the same as a fully connected layer, except we multiply the input with weights in a different way.
Conv2D layer
The key part here is tf.nn.conv2d(). Let’s look at each of its arguments.
 x is the input.

W are the weights.
 The weights have four dimensions: [filter_height, filter_width, input_depth, output_depth].
 What this means is that we have
output_depth filters in this layer.
 Each filter considers information with dimensions
[filter_height, filter_width, input_depth] at a time. Yes, each filter goes through ALL the input depth layers.
 This is like how, in a fully connected layer, we may have ten neurons, each of which interacts with all the neurons in the previous layer.
 Each filter considers information with dimensions
[filter_height, filter_width, input_depth] at a time. Yes, each filter goes through ALL the input depth layers.
 stride is the number of units the filter shifts each time.
 Why are there four dimensions? This is because the input tensor has four dimensions: [number_of_samples, height, width, colour_channels].
 strides = [1, strides, strides, 1] thus applies the filter to every image and every colour channel and to every strides image patch in the height and width dimensions.
 You don’t usually skip entire images or entire colour channels, so those positions are hardcoded as 1 here.
 E.g. strides=[1, 2, 2, 1] would apply the filter to every other image patch in each dimension. (Image below has width stride 1.)

"SAME" padding: the output size is the same as the input size. This requires the filter window to shift out of the input map. The portions where the filter window is outside of the input map is the padding.
 The alternative is "VALID" padding, where there is no padding. The filter window stays inside the input map the whole time (in valid positions), so the output size is smaller than the input.
2.2 What is maxpool2d (max_pool)?
Pooling layers reduce the spatial size of the output by replacing values in the kernel by a function of those values. E.g. in max pooling, you take the maximum out of every pool (kernel) as the new value for that pool.
1 2 3 4 
def maxpool2d(x, k=2): # MaxPool2D wrapper return tf.nn.max_pool(x, ksize=[1, k, k, 1], strides=[1, k, k, 1], padding='SAME') 
Here the kernel is square and the kernel size is set to be the same as the stride. It resizes the input as shown in the diagram below:
2.3 Layers for reshaping input
We reshape input twice in this model. The first time is at the beginning:
1 
x = tf.reshape(x, shape=[1, 28, 28, 1]) 
Recall the input was
1 
x = tf.placeholder(tf.float32, [None, n_input]) # input, i.e. pixels that constitute the image 
That is, each sample inputted to the model was a onedimensional array: an image flattened into a list of pixels. That is, the person who was preprocessing the MNIST dataset did this with each image:
And now we’re reversing the process:
The second time we reshape input is right after the convolutional layer before the first fully connected layer:
1 2 3 4 
# Reshape conv2 output to fit fully connected layer input # latter part gets the first dimension of the shape of weights['wd1'], i.e. the number of rows it has. fc1 = tf.reshape(conv2, [1, weights['wd1'].get_shape().as_list()[0]]) 
We’re doing this again to prepare 1D input for the fully connected layer:
2.4 What is a fully connected layer?
Do a Wx + b: For each neuron (number of neurons = number of outputs), multiply each input by a weight, sum all those products up and then add a bias to get your output for that neuron.
See post Explaining TensorFlow code for a Multilayer Perceptron.
2.5 What is Dropout?
Just before the output layer, we apply Dropout:
1 
fc1 = tf.nn.dropout(fc1, dropout) 
Dropout sets a proportion 1dropout of activations (neuron outputs) passed on to the next layer to zero. The zeroedout outputs are chosen randomly.
 What happens if we set the dropout parameter to 0?
This reduces overfitting by checking that the network can provide the right output even if some activations are dropped out.
And that’s a wrap – hope you found this useful! If you enjoyed this or have any suggestions, do leave a comment.
Further reading:
4 Comments on “Explaining Tensorflow Code for a Convolutional Neural Network”
great! I love you post. Could you explain: why we use “1” in the shape parameter of
x = tf.reshape(x, shape=[1, 28, 28, 1]) thanks.
Hi there, putting
1
means that the length can take any value there. Specifically, the first position refers to the number of examples you feed in. So the 1 in the first position means you can feed in as many or as few examples as you like in one go.You want this because you may change your batch size (or your final batch may have a different number of examples if the total number of training examples may not be divisible by your batch size). Hope this helps!
Thank you for your good explanation. If you don’t mind, please explain conv2d in the formulas. For example: We have an input I of shape [batch_size, w, h, i_channels] and weights (filters) W of shape [fw, fh, i_channels, o_channels]. So how does conv2d compute to output O = f(I, W) of shape [batch_size, w, h, o_channels] (in case of padding “SAME”).
Thank you in advance 🙂
Hi there, thanks for leaving a message!
The convolution demo in Stanford’s course on CNNs (CS231n) explains this well. Basically you take the dot product of the filter and the input once for every entry of your output volume, shifting the filter (width and heightwise) by
stride
units around the input to fill up the output.For TensorFlow’s implementation of conv2d, you can read the code for the function
convolution
here.For SAME padding, the output dimensions are ceil(input_dimensions/stride). We would pad the input with zeros if the input dimensions are not divisible by the stride.
Let me know if you have further questions.