Building A Convolutional Neural Network With TensorFlow
A neural network is a mathematical device that learns to approximate output based on training examples. It does so by modifying its internal state or variables based on the magnitude of the difference between the real value of each training example and the value that the network computed based on its present variables.
A deep neural network has at least two hidden layers. Finally, the term convolutional indicates that at least one of the layers applies convolution to the previous layer’s representation. We will talk more about convolution later.
TensorFlow is a library created by Google which allows programmers to define, train, and test neural networks.
The purpose of this post is to explain the function and implementation of a deep convolutional neural network in TensorFlow for classifying images from the popular MNIST data set. MNIST is a labeled data set of hand-written digits.
The Basics #
TensorFlow models a neural network as a data-flow graph. That is to say, we will describe a network as a collection of edges along which data of particular dimensionality or shape passes as well as some nodes which apply functions to or reshape the data – such as by converting from a [4, 1] vector (tensor) to a [2,2] matrix (also, a tensor).
Let’s talk tensors.
A tensor is a multi-dimensional array. It is the general term for a bunch of numbers in a particular configuration.
An image can be represented as a tensor with shape [#width, #height, #channels]. For example, a black and white image of size 128x128 would have the shape [128, 128, 1]. An RGB color image of the same size would have the shape [128, 128, 3].
Tensors can represent much more than images, of course. Any data that can be expressed in terms of multi-dimensional numbers can be a tensor.
A neural network with a particular architecture can be described in terms of three properties, two of which are tensors (data and variables) and one of which describes the tensors (shapes).
In this section, we examine those three properties more closely.
Shapes represent the dimensionality of a tensor. For example, in MNIST, the data may start off in as having an input shape [28, 28, 1] – 28x28, single channel images – and output of shape of [10, 1] – a 10x1 vector representing the probability of the input image being every particular digit. Both data and the variables have shapes.
Data flows through the network. It is the training and testing data that is supplied at runtime. It interacts with variables, is morphed according to the specified shapes of various transformations, and ultimately produces an output. Data is what we commonly refer to as a tensor. Hence, TensorFlow.
Variables are applied to data as it flows through the graph according to certain functions which are part of the network architecture. In neural network world, the quintessential variable is the weight between two nodes in a network. Based on labeled training data and a user-specified process, e.g. back-propagation by gradient descent, the network learns the optimal value for each variable.
We can model vanilla perceptrons with matrix multiplication.
The beginner MNIST tutorial shows how variables are defined and matrix multiplication is used to replicate a layer of a perceptron. The network is then configured to minimize the specified loss function. As the network is trained, it finds more and more optimal values for the marked variables using the specified method, in this case GradientDescent.
Easy to read, easy to implement. Unfortunately, the single layer neural network is constrained because it can only learn relationships whose input is separable by hyperplane. This makes sense if you consider that a perceptron functions by multiplying each input by a weight in its matrix and summing over the resulting products. The sum can be compared to some threshold value, essentially dividing the input space in two.
To learn more sophisticated functions, we will need more sophisticated transformations. Onwards!
Diving Into Convolutional Neural Networks #
A convolutional neural network is modeled after the mammalian visual cortex. In the context of image recognition, such a network encodes each pixel with additional information about its locality to create a more sophisticated model for understanding the broader image. By having multiple layers, each of which considers multiple parts of the previous layer’s representation, the convolutional neural network is able to learn sophisticated functions over the input data.
The most challenging part of understanding the network is understanding the TensorFlow API function, tf.nn.conv2d. Read this part carefully.
Keep in mind that we applied a matrix to our network to simulate a fully connected neural network layer. Using the convolution function will be much the same.
As we have learned, convolution in image processing is a piece-wise transformation of an image based on each pixel’s locality. For example, a simple convolution might be setting each pixel’s value to to the average value of every nearby pixel (say, within a 3x3 bounding box).
In TensorFlow, we represent a convolution with a filter. The shape of a filter is [#filter_height, #filter_width, #in_channels, #out_channels]. What would the shape of the transformation we describe above have on a single channel image? [3, 3, 1, 1].
To MNIST classifier, we use a transformation filter with shape [5, 5, 1, 32]. This filter drastically increases the dimensionality of our data. Each pixel will be transformed to have 32 channels which are derived from the 25 pixel bounding box.
# We define the convolution filter we want to be a bunch of random numbers from a distribution and our biases to be some constants. The numbers are slightly positive to avoid dead neurons.
W_conv1 = tf.Variable(tf.truncated_normal([5, 5, 1, 32], stddev=0.1))
# the b's denote "bias" or that the network can learn to fit the data. In more simple examples, constants are beneficial because they allow the model to separate data without the separating hyperplane passing through the origin.
b_conv1 = tf.Variable(tf.constant(0.1, shape=[32]))
The convolution is then applied to a tensor.
# The shape of x is initially 784x1. We reshape each data point to have a dimensionality of 28x28x1 in order to apply a 2D convolution.
x_image = tf.reshape(x, [-1,28,28,1])
# The relu function assigns each numerical component of the resulting tensor to be max(0, value). Rectified linear units are utilized as an activation function in some deep neural networks these days instead of sigmoid activation or tanh functions because they have shown promising results and are much less computationally expensive to deal with during training (their derivative is literally 0 or 1 depending on the x-value).
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
Each convolutional step considers the area surrounding a pixel and creates a new representation which may have the same or different number of channels. The values of the transformation are learned by the algorithm based on the supplied cost function.
Examining the Architecture of the MNIST Convolutional Neural Network #
Let’s review how perceptrons and convolutional layers fit together to classify the handwritten digits from MNIST.
First Convolutional Layer
The first convolutional layer is what we talked about in the previous section. Our data comes in as MNIST’s standard input, with a shape of [784, 1]. The first layer preprocesses the data by transforming it to have the shape [28, 28, 1] and then runs it through a filter with variable weights using 2D convolution to create a new representation of [28, 28, 32]. The result is run through a ReLU activation function. Finally, we down sample our data, in order to reduce the complexity down the chain. For every 2x2 block, we choose the single data point with the highest magnitude.
The output for the first layer is a tensor with shape [14, 14, 32].
Second Convolutional Layer
The second convolutional layer accepts the output of the previous layer and runs it through another 2D convolution with variable filter of size [5, 5, 32, 64]. Each 5x5 block with 32 channels is considered to make a 64 channel representation of the input image. Again, we down sample the data.
The output is a [7, 7, 64] tensor.
In the final two steps, we want to learn some way to go from the convolved image representation to a ten node layer, where each node will represent the probability of the input image being a particular digit. Remember, the goal of this exercise is to learn to classify 28x28 images as being some digit.
Penultimate Pooling Layer
At this point, we have 7 * 7 * 64 (3185) data point representing each of our images. Going from a high dimensional representation to a ten-node one may cause data loss. Thus, we add a fully connected linear layer which converts our three dimensional representation back to a vector representation with 1024 hidden nodes. More on this later: neural network architecture is not quite a science. We know what tends to work for certain problems. We call this the pooling layer.
# We create a variable matrix that will go from our 3185 dimensional input image to a 1024 dimensional output vector.
W_fc1 = weight_variable([7 * 7 * 64, 1024])
b_fc1 = bias_variable([1024])
# We convert each our our previous-layer-outputs to a flattened representation which will be run through the weight matrix, representing a perceptron's transformation and run both through the matrix multiplication function.
h_pool2_flat = tf.reshape(prev_layer_output, [-1, 7*7*64])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)
What is the shape of the input and output of this layer? [7, 7, 64] and [1024, 1] respectively.
Final Linear Layer
We introduce one final linear layer, to convert our [1024, 1] tensor to the [10, 1] tensor we want. Each unit in the [10, 1] tensor will represent the probability of the input to the neural network being some particular digit.
# We know this step. We are specifying some weight and bias variables.
W_fc2 = weight_variable([1024, 10])
b_fc2 = bias_variable([10])
# And applying our matrix multiplication, which simulates a fully connected neural network layer.
y_conv=tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)
Finally, we apply the softmax function which converts the output vector such that all of its components sum to one. In training, we will apply another transformation that converts the final representation into a one-hot vector, where every value is zero but one.
Training and Conclusion #
Having specified a model for our neural network, as well as the shape and initial values of all of our variables, we can now tell our model to begin learning the variables with respect to some cost function.
In the case of MNIST, we say that the cost function of our algorithm is the negative cross entropy of the input and output data.
How does the learning of parameters happen?
Well, TensorFlow handles it for us behind the scenes. Due to the types of transformations that are allowed, the cost function is differentiable with respect to each variable. Thus, with every training example, TensorFlow examines how much each variable contributed to the cost of our prediction and modifies the variable accordingly.
For more information on learning by modifying variables, typically referred to as weights, read about back-propagation on the Internet. I recommend getting a feel for the math behind the process.
One notable drawback of neural networks is that there is somewhat little transparency into how a neural network morphs data through its layers. Visualizing complex transformations over multi-dimensional space is not a solved problem and it takes a great deal of effort to devise experiments which allow us to interpret intermediate results.
As a result, the configuration of a network’s layers is as much an art as a science. We know what tends to work for some types of problems, e.g. convolution for images and use that to guide the layout of our network or adding a pooling layer after convolution steps. Once the configuration is defined, weights are learned simply because they yield the best results during training relative to some cost function.
Nonetheless, multi-layer neural networks are a powerful tool for learning complex representations of data and TensorFlow makes it easy to experiment with different network configurations.