I’ve recently been trying to wrap my head around deep learning algorithms and the frameworks used to implement them. I’ve done most of my ML work in R and that’s been great. However, R doesn’t have great support for neural nets and its design is not great for huge datasets. There are several frameworks out there now that are specifically designed for implementing neural networks and simplifying their computation. Most of these libraries seem to be written in C (high speed) with python (numpy) wrappers. Specifically, I’ve been starting to experiment with TensorFlow and Caffe. Both are widely used. Both are in python.

As a fun project, I decided to teach my computer to tell the difference between pictures of dogs and pictures of cats. As recently as 2007, this was a very difficult computer vision problem, but advances in deep learning have made it relatively trivial. It’s now a common toy example used to demonstrate the setup and execution of convolutional neural nets. In this post, I’ll introduce some of the code I used to build my cat/dog classifier, using TensorFlow, and specifically, a high-level wrapper called TFLearn. TFLearn is designed to make the coding of CNN models easier – and man, it’s pretty easy. The hardest part of developing the model, I found, was keeping track of the dimensions of your numpy arrays. Note: There are a lot of great tutorials out there that I drew heavily on while putting this script together. When I get a bit more time, I’d like to put together a more in-depth tutorial.

In the end, I was able to get a classifier with a ~99% accuracy on a validation set. I was pretty psyched about this, since this was my first foray into deep learning. It goes to show that these framework (TensorFlow, TFLearn, Caffe) can vastly simplify the process of building and training complex models.

There are four main parts of our program.

1. Get the data into python.
2. Prepare the data for the algorithm.
3. Define the network architecture.
4. Train the model and get the results.

Input: I used the cat/dog dataset available on Kaggle. This gives 25,000 labeled images of cats and dogs to work with. CNNs work with (multidimensional) arrays of pixel values, not raw images. Therefore, the first thing to do is to read the images into python.I actually found this first step to be the most challenging, since I haven’t done much with image I/O in python. Most posts I found on the interwebs skipped this part entirely. In the end, I found that the step was pretty much trivial. The CV2 and skimage libraries both provide simple methods of reading images to numpy arrays of pixel values.

My function to read the pixel values and parse the label from a filename

Of course, we have an entire directory of labeled training images, all of which we’d like to process.

A function to read a directory of images to a big array of 3-D arrays of pixel values and a big 1-D array of labels

TFLearn has its own image loader. However, it expects your images to be in separate directories for each category. I decided that (1) I liked having all my training examples in a single directory, and (2) I wanted the experience of writing a custom loader.

Preprocessing: The images you feed into a a CNN should all be of the same size. The loader I wrote above has an optional resize argument that will resize each image to IMG_SIZE x IMG_SIZE pixels, and save them to another diretory, so they can be loaded more quickly (without resizing) later on.

Resize the images to be a consistent square size.

You also often want to scale and center the training data to make it more consistent. TFLearn has built-in utilities for this.

Scale and center the data

And finally, preprocessing the labels. TF expects a so-called ‘one-hot’ representation of the labels. Basically a binary vector of length n (for n classes) with a one for the class being represented and zeros for all other classes.

Parse the labels

Defining network architecture: With TensorFlow and TFLearn in particular, it is quite easy to build complex CNN (and other deep learning) models. The mathematics are all wrapped in tidy functions, so it’s basically just plug-in-play. Yay abstraction!

Blueprint the CNN architecture

I am far from an expert at the architecture. To be honest, I just dropped a lot of layers, in order, together. I did a bit of reading, and know the general sequences (conv-> pool) and (fully-connected -> dropout -> softmax) are pretty standard. Big name CNNs (AlexNet, etc) put a ton of thought into the sequence of the layers. I wonder what additional accuracy I could get by reordering the layers I have here.

Train the model: Finally, it is time to train the classifier. Here, I’m telling it to train on the training set X with labels Y. I also tell it to use a validation set for determining how accurate it is. I tell it to train for 12 epochs, meaning that all the data will be seen 12 times.

Train the model

One thing I encountered that turned out to be very useful was saving and loading models. To save a model once it has been fitted,

Save the model