Deep Learning Frameworks

TensorFlow Conv2D Layers: A Practical Guide

A Convolutional Neural Network (CNN) has three important building blocks:

A convolutional layer that extracts features from the image or parts of an image

A subsampling or pooling layer that reduces the dimensionality of each feature to focus on the most important elements (typically there are several rounds of convolution and pooling)

A fully connected layer that takes a flattened form of the features identified in the previous layers, and uses them to make a prediction about the image. Convolutional Neural Network (CNN)   In TensorFlow, you build a CNN architecture using the following process:

1. Reshape input if necessary using tf.reshape() to match the convolutional layer you intend to build (for example, if using a 2D convolution, reshape it into three-dimensional format)

2. Create a convolutional layer using tf.nn.conv1d(), tf.nn.conv2d(), or tf.nn.conv3d, depending on the dimensionality of the input. >> You are here – in this article we explain tf.nn.conv2d() in more detail

3. Create a poling layer using tf.nn.maxpool()

4. Repeat steps 2 and 3 for additional convolution and pooling layers

5. Reshape output of convolution and pooling layers, flattening it to prepare for the fully connected layer

6. Create a fully connected layer using tf.matmul() function, add an activation using, for example, tf.nn.relu() (see all TensorFlow activations, or learn more in our guide to neural network activation functions), and apply a dropout using tf.nn.dropout() (learn more about dropout in our guide to neural network hyperparameters)

7. Create final layer for class prediction, again using tf.matmul() 8. Store weights and biases using TensorFlow variables   These are just the basic steps to create the CNN model, there are additional steps to define training and evaluation, execute the model and tune it – see our full guide to TensorFlow CNN.  

In this page you will learn:

What is a 2D Convolution and its Role in CNN Visual Recognition

A convolution layer extracts features from a source image by “scanning” the image with a filter of, for example, 5×5 pixels. For each 5×5 pixel region within the image, the convolution operation computes the dot products between the values of the image pixels and the weights defined in the filter.  

A 2D convolution layer means that the input of the convolution operation is three-dimensional. This is a bit confusing, as you’d expect the input to be two-dimensional. But the “2D” in “2D convolution” refers to the movement of the filter, which traverses the image in two dimensions. For example, a color image which has a value for each pixel across three layers: red, blue and green. The filter is then run across the image three times, once for each layer.

  2d convolution 

The same convolutional structure is used successively, at first to identify features in the original image, and then to identify sub-features within smaller parts of the image, after downsampling or “pooling” the result of previous convolutions. Eventually, this process is meant to identify the essential features that can help classify the image. Learn more in our guide to Convolutional Neural Networks.  

 

Using tf.nn.conv2d – Code Example and Walkthrough

tf.nn.conv2d() is the TensorFlow function you can use to build a 2D convolutional layer as part of your CNN architecture. tt.nn.conv2d() is a low-level API which gives you full control over how the convolution is structured. To learn about a simpler functional interface called tf.layers.conv2d(), which abstracts these steps, see the following section.   We’ll illustrate how the tf.nn.conv2d() function works using the TensorFlow example by Aymeric Damien which generates predictions for MNIST handwritten digits.   The conv2d() related code is highlighted in yellow, in full context of the TensorFlow CNN model (omitting the code for executing model training).  

# Import MNIST data
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)

# Training Parameters
learning_rate = 0.001
num_steps = 200
batch_size = 128
display_step = 10

# Network Parameters
num_input = 784 # MNIST data input (img shape: 28*28)
num_classes = 10 # MNIST total classes (0-9 digits)
dropout = 0.75 # Dropout, probability to keep units

# tf Graph input
X = tf.placeholder(tf.float32, [None, num_input])
Y = tf.placeholder(tf.float32, [None, num_classes])
keep_prob = tf.placeholder(tf.float32) # dropout (keep probability)


# Create some wrappers for simplicity
def conv2d(x, W, b, strides=1):
    # Conv2D wrapper, with bias and relu activation
    x = tf.nn.conv2d(x, W, strides=[1, strides, strides, 1], padding='SAME')
    x = tf.nn.bias_add(x, b)
    return tf.nn.relu(x)


def maxpool2d(x, k=2):
    # MaxPool2D wrapper
    return tf.nn.max_pool(x, ksize=[1, k, k, 1], strides=[1, k, k, 1],
                          padding='SAME')


# Create model
def conv_net(x, weights, biases, dropout):
    # MNIST data input is a 1-D vector of 784 features (28*28 pixels)
    # Reshape to match picture format [Height x Width x Channel]
    # Tensor input become 4-D: [Batch Size, Height, Width, Channel]
    x = tf.reshape(x, shape=[-1, 28, 28, 1])

    # Convolution Layer
    conv1 = conv2d(x, weights['wc1'], biases['bc1'])
    # Max Pooling (down-sampling)
    conv1 = maxpool2d(conv1, k=2)

    # Convolution Layer
    conv2 = conv2d(conv1, weights['wc2'], biases['bc2'])
    # Max Pooling (down-sampling)
    conv2 = maxpool2d(conv2, k=2)

    # Fully connected layer
    # Reshape conv2 output to fit fully connected layer input
    fc1 = tf.reshape(conv2, [-1, weights['wd1'].get_shape().as_list()[0]])
    fc1 = tf.add(tf.matmul(fc1, weights['wd1']), biases['bd1'])
    fc1 = tf.nn.relu(fc1)
    # Apply Dropout
    fc1 = tf.nn.dropout(fc1, dropout)

    # Output, class prediction
    out = tf.add(tf.matmul(fc1, weights['out']), biases['out'])
    return out

# Store layers weight & bias
weights = {
    # 5x5 conv, 1 input, 32 outputs
    'wc1': tf.Variable(tf.random_normal([5, 5, 1, 32])),
    # 5x5 conv, 32 inputs, 64 outputs
    'wc2': tf.Variable(tf.random_normal([5, 5, 32, 64])),
    # fully connected, 7*7*64 inputs, 1024 outputs
    'wd1': tf.Variable(tf.random_normal([7*7*64, 1024])),
    # 1024 inputs, 10 outputs (class prediction)
    'out': tf.Variable(tf.random_normal([1024, num_classes]))
}

biases = {
    'bc1': tf.Variable(tf.random_normal([32])),
    'bc2': tf.Variable(tf.random_normal([64])),
    'bd1': tf.Variable(tf.random_normal([1024])),
    'out': tf.Variable(tf.random_normal([num_classes]))
}

# Construct model
logits = conv_net(X, weights, biases, keep_prob)
prediction = tf.nn.softmax(logits)

# Define loss and optimizer
loss_op = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(
    logits=logits, labels=Y))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
train_op = optimizer.minimize(loss_op)


# Evaluate model
correct_pred = tf.equal(tf.argmax(prediction, 1), tf.argmax(Y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

 

In the implementation above, a utility is defined which creates a 2D convolutional layer and then adds biases and applies a ReLu activation layer.  

Let’s review the arguments of the conv2d() function:   x is the input – pixel values from the image.   W are the weights defined in the filter. The weights are defined as a four-dimensional tensor: [filter_height, filter_width, input_depth, output_depth].

input_depth represents the number of layers in the image, for example three layers for RGB.

output_depth represents the number of filters that should be applied to the image. Each filter is run through all the input layers, using a filter size defined by filter_height and filter_width, multiplies each input pixel by a weight, and sums up the results.

 stride is the speed by which the filter moves across the image, or the number of pixels it shifts every time. The stride is defined as a 4D tensor, because the input has four dimensions: [number_of_samples, height, width, colour_channels]. Setting the strides tensor to [1, strides, strides, 1] applies the filter to every image, every color channel, and every image patch in the height and width dimensions. 1 at the beginning and end specifies you won’t skip an image or an entire color channel. For example, strides=[1, 2, 2, 1] applies the filter to half the image patches in each dimension, and skips half.

  “SAME” padding specifies that the output size should be the same as the input size. In order to achieve this, there is a one-pixel-width padding around the image, and the filter slides outside the image into this padding area. Alternatively, you can use “VALID” padding in which the filter stays inside the pixel areas of the image, resulting in an output size smaller than the input.  

 

Using tf.layers.conv2d as an Easier Functional Interface for nn.conv2d

tf.layers.conv2d() creates a convolution filter that produces a tensor of outputs, and takes care of all aspects of the convolutional layer, including bias and activation. Unlike when you use the low-level tf.nn.conv2d() function, which only performs the convolution operation and requires that you define bias and activation separately.  

Here are some of the important arguments of the tf.layers.conv2d() abstraction:

Inputs – a Tensor input, representing image pixels which should have been reshaped into a 2D format

filters – the number of filters in the convolution (dimensionality of the output space).

kernel_size – the filter size, an integer or tuple of 2 integers, specifying the height and width of the convolution window. Set a single integer to use a filter with identical height and width.

strides – an integer or tuple of 2 integers, specifying how the filter should move along the height and width. Set a single integer to use the same stride value for both dimensions.

padding: "VALID", meaning the image is padded with a border of one pixel, or "SAME", meaning the filter moves within the image with no padding, generating a smaller output.

data_format – specifies ordering of dimensions in the inputs, can be either channels_last (default, inputs with shape [batch, height, width, channels]) or channels_first (inputs with shape [batch, channels, height, width].

dilation_rate – enables advanced convolutional structures with dilated (expanded) convolutions. Use an integer or tuple of 2 integers to specify the dilation rate.

activation – the activation function you’d like to use. Set to None for linear activation.

use_bias – boolean, specifies whether a bias should be added to the layer.   For all parameters see TensorFlow documentation.  

Running CNN on TensorFlow in the Real World

In this article, we explained how to create a 2D convolutional layer in TensorFlow. When you start working on CNN projects and running large numbers of experiments, you’ll run into some practical challenges:  

tracking experiments

Tracking experiment progress and hyperparameters across multiple experiments—CNNs can have a large number of possible variations which may impact your results. To test each of these, you will need to run and tracking numerous experiments.

running experiment across multiple machines

Running experiments across multiple machines—running multiple CNN experiments, especially with large datasets, will require multiple machines or GPUs. Provisioning machines, distributing experiments between them and monitoring progress can become a burden.

manage training datasets

Manage training data—if you work on CNN projects with images, video or other rich media, training sets can get very large. Copying this data to each training machine and replacing it for different experiments can be time-consuming. An automated way is needed to manage the data and copy it efficiently to deep learning machines.

MissingLink is a deep learning platform that can help you automate these operational aspects of CNN on TensorFlow, so you can concentrate on building winning experiments. Learn more.