I’ve nursed a side interest in machine learning and computer vision since my time in graduate school. When Google released its Tensorflow framework and Inception architecture, I decided to do a deep dive into both technologies in my spare time.
The Inception model is particularly exciting because it’s been battle-tested, delivering world-class results in the widely-acknowledged ImageNet Large Scale Visual Recognition Challenge (ILSVRC). It’s also designed to be computationally efficient, using 12x fewer parameters than other competitors, allowing Inception to be used on less-powerful systems.
I wrote this series because I couldn’t easily bridge the gap between Tensorflow’s tutorials and doing something practical with Inception. Inspired by Inception’s own origins, this tutorial will “go deeper,” presenting a soup-to-nuts tutorial using Inception to train a MNIST (hand-written digits) classifier. While the goal isn’t to get to world-class performance, we’ll get a model that performs at >99% accuracy.
Who Should Use this Tutorial Series?
This is a practical introduction, so it’s not focused on the theories that underly neural networks, computer vision, or deep learning models (though there will be a few remarks about the general motivation behind Inception in Part 2).
Instead, this tutorial is aimed at folks who have done the basic Tensorflow tutorials and want to “go a little deeper” to apply Inception to their own projects. Think graduate student embarking on a project, or a software engineer who’s been asked to build a training and inference pipeline.
The tutorial is roughly divided into 4 parts:
- Part 1: Using Slim to Build Deep Architectures
Deep architectures are complicated beasts. Slim is a library that can help you tame the complexity.
- Part 2: Introduction to Inception
How is Inception put together? What is its rough architecture?
- Part 3: Training Inception on a Novel Dataset
How do I apply Inception to my dataset? How do I use Tensorboard to monitor the training?
- Part 4: Using Inception for Inference
I’ve trained. How do I predict new cases?
Obviously, you’ll need a workstation with Tensorflow installed. One of the easiest ways I’ve found is to use Docker and just grab the latest image. This works well for smaller projects and experimentation.
Training deep learning models like Inception is so computationally intensive that running it on your laptop is impractical. Instead, you’ll need access to a GPU. A few options:
- Build your own: I’ve done most of my training on Amazon using a
p2.xlargeGPU instance that I built from scratch. For that, you’ll need to build Tensorflow with Nvidia’s drivers. Directions on how to do that are here.
- Use a pre-built AMI: Amazon has a pre-built AMI with a variety of pre-packaged frameworks (MXNet, Caffe, Tensorflow, Theano, Torch and CNTK).
- Paperspace: For those that don’t want to futz with building their own box or the hassle of running a server, Paperspace offers a GPU-enabled linux desktop box in the cloud that is purpose-built for machine learning. Signup is fast and easy and gives you access to a desktop computing environment through your browser.
I’ve also created a Github repository with code samples that we’ll use in the series. Check it out here.
Using Slim to Build Deep Architectures
At the core of Tensorflow is the notion of a computational graph. Operations in our neural network (e.g., convolution, bias adding, dropout, etc..) are all modeled as nodes and edges in this graph. Defining an architecture for a learning task is tantamount to defining this graph.
Tensorflow provides many primitives for defining these graphs and if you’ve run through the introductory tutorials you’ve undoubtedly encountered them when constructing simple neural net architectures. However, as the complexity of our architecture grows, these simple primitives become cumbersome.
Slim is Tensorflow library that bundles commonly used building blocks like convolution and max pooling. Access it by simply importing it:
import tensorflow.contrib.slim as slim
Using slim makes it simple to chain multiple building blocks together. For instance, the following code will create a neural network with two layers, each with 256 hidden units:
net = slim.fully_connected(input, 256, scope='layer1-256-fc')
net = slim.fully_connected(net, 256, scope='layer2-256-fc')return netinput = load_data()
output = my_neural_network(input)
Slim will do all of the heavy lifting; it defines the appropriate weight and bias variables and links them in the appropriate way. Even more conveniently, Slim does all of this under a named scope that you provide allowing you to navigate your architecture in Tensorboard.
NielsenNet: A Guided Example
As a simple example of how Slim builds more complicated architectures, consider the MNIST classifier that is presented in Chapter 6 of Michael Nielsen’s wonderful textbook “Neural Networks and Deep Learning.” The neural network, which I’ve christened “NielsenNet” consists of:
- An 28×28 input representing a monochrome image of a handwritten digit
- A convolution layer with 20 kernels, stride=1, size=5, followed by 2×2 max pooling. The convolution layer is padded to maintain the spatial dimensions.
- Another convolution layer with 40 kernels, stride=1, size=5, again followed by 2×2 max pooling. This time the input is not padded to maintain dimensions.
- A fully-connected layer of 1000 hidden units with dropout
- Another fully-connected of 1000 hidden units, again with dropout
- An output layer of 10, corresponding with the 10 output classes for the MNIST classification problem.
This architecture is implemented here using slim:
To see how this code is used, I’ve created a Jupyter notebook that trains the NielsenNet against the included MNIST dataset for 100k steps. It ends with an accuracy of approximately 99.51%. Not bad for such a simple network!
Using Tensorboard, we can even visualize the graph that’s created, giving you an overview of your architecture and how all of the major pieces connect.
One nice feature of using slim is that your basic building blocks are automatically associated within a named scope, which makes it easy to visualize the overall structure and connectivity of your neural network architecture.
This will become important as the complexity of the models we tackle grow.
Introduction to Inception
Inception was developed at Google to provide state of the art performance on the ImageNet Large-Scale Visual Recognition Challenge and to be more computationally efficient than its competitor architectures. However, what makes Inception exciting is that its architecture can be applied to a whole host of other learning problems in computer vision.
This tutorial focuses on retraining Inception on our old friend, the MNIST dataset. The goal is not to attain world-class performance on digit classification (in fact, using Inception is probably overkill) but to get experience on a known problem. In other words, it’s a toy problem but running through the effort will give us a good good experience to tackle other problems.
High Level Overview of the Inception Architecture
Below is an overview of the Inception architecture that I’ve liberated from the README. I’ve added some annotations which will allow you to get your bearings when you’re looking at Tensorboard histograms or digging into the code.