ConvNet: Deep Convolutional Networks

What’s ConvNet?

Convolutional network is a specific artificial neural network topology that is inspired by biological visual cortex and tailored for computer vision tasks by Yann LeCun in early 1990s. See http://deeplearning.net/tutorial/lenet.html for introduction.

The convolutional network implemented in ccv is based on Alex Krizhevsky’s ground-breaking work presented in:

ImageNet Classification with Deep Convolutional Neural Networks, Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton, NIPS 2012

The parameters are modified based on Matthew D. Zeiler’s work presented in:

Visualizing and Understanding Convolutional Networks, Matthew D. Zeiler, and Rob Fergus, Arxiv 1311.2901 (Nov 2013)

The multi-GPU implementation was heavily influenced by:

One Weird Trick for Parallelizing Convolutional Neural Networks, Alex Krizhevsky, ICLR 2014

The VGG-D model trained is based on:

Very Deep Convolutional Networks for Large-Scale Image Recognition, Karen Simonyan, Andrew Zisserman, ICLR 2015*

How it works?

With advances in GPGPU programming, we can have very deep convolutional networks (with over 50 million parameters) trained on millions of images. It turns out that once you have both and a bag of tricks (dropout, pooling etc.), the resulted convolutional networks can achieve good image classification results.

> ./cnnclassify ../samples/dex.png ../samples/image-net-2012-vgg-d.sqlite3 | ./cnndraw.rb ../samples/image-net-2012.words ../samples/dex.png output.png

Check output.png, the convolutional networks suggest a few possible relevant classes in the top left chart.

What about the performance?

ConvNet on the very large scale is not extremely fast. There are a few implementations available for ConvNet that focused on speed performance, such as Caffe from Berkeley, or OverFeat from NYU. Although not explicitly optimized for speed (ccv chooses correctness over speed in this preliminary implementation), the ConvNet implementation presented in ccv speed-wise is inline with other implementations.

Therefore, the analysis related to performance is implemented on ImageNet dataset and the network topology followed the exact specification detailed in the paper.

Accuracy-wise:

The test is performed on ILSVRC 2012 validation dataset when use vgg_d_params.

For ILSVRC2012 dataset, the training stopped to improve at around 55 epochs, at that time, the central patch from validation set obtained 31.96% of top-1 missing rate (lower is better) and the training set obtained 28.00% of top-1 missing rate. In Alex’s paper, they reported 40.5% top-1 missing rate when averaging 10 patches. In Matt’s paper, they reported 38.4% top-1 missing rate when using 1 convnet as configured in Fig.3 and averaging 10 patches. In VGG’s paper, they reported 27.0% top-1 missing rate when averaging densely sampled image.

Assuming you have ILSVRC 2012 validation set files ordered in image-net-2012-val.txt, run

> ./cnnclassify image-net-2012-val.txt ../samples/image-net-2012-vgg-d.sqlite3 > image-net-2012-classify.txt

For complete validation set to finish, this command takes 6 hour on one GPU, and if you don’t have GPU enabled, it will take about two days to run on CPU.

Assuming you have the ILSVRC 2012 validation ground truth data in LSVRC2012_val_ground_truth.txt

> ./cnnvldtr.rb LSVRC2012_val_ground_truth.txt image-net-2012-classify.txt

will reports the top-1 missing rate as well as top-5 missing rate.

For 32-bit float point image-net-2012-vgg-d-32bit.sqlite3 on GPU, the top-1 missing rate is 29.61%, 8.56% (in absolute) better than ccv’s previous result with 1 convnet, and 2.3% (in absolute) worse than VGG’s result with 1 convnet and configured with mode D. The top-5 missing rate is 9.9%, 6.3% better than previous and 1.1% worse than VGG’s. For half precision image-net-2012-vgg-d.sqlite3 (the one default downloaded to ./samples/), the top-1 missing rate is 29.61% and the top-5 missing rate is 9.9%.

You can download the 32-bit float point versions with ./samples/download-vgg-d-32bit.sh

See http://www.image-net.org/challenges/LSVRC/2014/results for the current state-of-the-art, ccv’s implementation is still about 3.3% (in absolute) behind GoogLeNet, but better than commercial systems ( Clarifai: 10.7%).

For half-precision image-net-2012.sqlite3 (matt_params), the top-1 missing rate is 38.17%, and the top-5 missing rate is 16.22%.

Speed-wise:

The experiment conducted on a computer with Core i7 5930K, 4-way hybrid of NVIDIA TITAN and NVIDIA TITAN Black graphic card at stock frequency, Samsung MZ-7TE500BW 500GiB SSD with clang, libdispatch, libatlas and GNU Scientific Library.

The CPU version of forward pass (from RGB image input to the classification result) takes about 700ms per image. This is achieved with multi-threaded convolutional kernel computation. Decaf ( the CPU counter-part of Caffe) reported their forward pass at around 0.5s per image with unspecified hardware over 10 patches (the same as ccv’s cnnclassify implementation). I cannot get sensible number off OverFeat on my machine (it reports about 1.4s for forward pass, that makes little sense). Their reported number are 1s per image on unspecified configuration with unspecified hardware (I suspect that their unspecified configuration does much more than the averaging 10 patches ccv or Decaf does).

For AlexNet 12, the GPU version does forward pass + backward error propagate for batch size of 128 in about 0.664s. Thus, training ImageNet convolutional network takes about 186 hours with 100 epochs. Caffe reported their forward pass + backward error propagate for batch size of 256 in about 1.3s on NVIDIA TITAN. In the paper, Alex reported 90 epochs within 6 days on two GeForce 580. In “Multi-GPU Training of ConvNets” (Omry Yadan, Keith Adams, Yaniv Taigman, and Marc’Aurelio Ranzato, arXiv:1312.5853), Omry mentioned that they did 100 epochs of AlexNet in 10.5 days on 1 GPU), which suggests my time is within line of these implementations.

For MattNet, the single GPU version does forward pass + backward error propagate for batch size of 128 in about 0.845s. With 4 GPUs, the MattNet can be trained at batch size of 512 in about 0.909s. Thus, 3.72x speed up. In AlexNet 14 One weird trick, the reported speed up is 3.74x. “Multi-GPU Training of ConvNets” reported 2.2x speed up with hybrid approach on 4 GPUs. The implementation is inline with AlexNet 14 findings.

For VGG-D model, the GPU version does forward pass + backward error propagate for batch size of 192 in about 5.439s with 4 GPUs.

For AlexNet 14, the reported time on one GPU with 90 epochs is 98.05 hours. ccv’s implementation of AlexNet 14 does forward pass + backward error propagate for batch size of 128 in about 0.55s, thus, for 90 epochs, will take 137.6 hours.

As a preliminary implementation, I didn’t spend enough time to optimize these operations in ccv if any at all. For example, cuda-convnet implements its functionalities in about 10,000 lines of code, Caffe implements with 14,000 lines of code, as of this release, ccv implements with about 4,300 lines of code. For the future, the low-hanging optimization opportunities include using cuDNN, doing FFT in densely convolved layers etc.

How to train my own image classifier?

First, you need to figure out your network topology. For all intents and purposes, I will walk you through how to train with ImageNet LSVRC 2012 data.

You need three things: the actual ImageNet dataset (and metadata), a CUDA GPU with no less than 6GiB on-board memory and a sufficient large SSD device to hold ImageNet dataset (otherwise loading data from your rotational disk will take more time than the actual computation).

I downloaded the ImageNet dataset from this torrent:

Assuming you’ve downloaded / bought all these and installed on your computer, get a hot tea, it will take a while to get all the puzzles and riddles in place for the training starts.

The ImageNet metadata for 2012 challenge can be downloaded from http://www.image-net.org/challenges/LSVRC/2012/download-public

Unfortunately, the metadata are stored in Matlab proprietary format, there are some conversion work to be done. Here will demonstrate how to use Octave to do this. Install Octave on Linux-like system is easy, for me on Ubuntu, it is about one line:

> sudo apt-get install octave

Assuming you’ve downloaded devkit-1.0 from the above link, and found meta.mat file somewhere in that tarball, launching Octave interactive environment and run:

octave> file = fopen('meta.txt', 'w+')
octave> for i = 1:1000
octave>	      fprintf(file, "%d %s %d\n", synsets(i).ILSVRC2012_ID, synsets(i).WNID, synsets(i).num_train_images)
octave> endfor
octave> fclose(file)

The newly created meta.txt file will give us the class id, the WordNet id, and the number of training image available for each class.

The ImageNet data downloaded from the torrent puts the training images into directories named by the WordNet ids.

> find <ImageNet dataset>/train/ -name "*.JPEG" > train-file.txt

I use this script to generate format that ccv understands: https://gist.github.com/liuliu/8393461

The test dataset is ordered numerically, thus,

> find <ImageNet dataset>/test/ -name "*.JPEG" > test-file.txt

will generate file list corresponding to ILSVRC2012_test_ground_truth.txt for class ids.

This script: https://gist.github.com/liuliu/8393516 will generate the plain text that ccv understands for tests.

These images need to be first pre-processed to correct size for training.

I partially replaced ./bin/image-net.c with this snippet: https://gist.github.com/liuliu/8906523 to generate files suffixed with .resize.png. Compile and run:

> ./image-net --train-list ~/Fast/imageNet/train-file.txt --test-list ~/Fast/imageNet/test-file.txt --base-dir ~/Fast/imageNet --working-dir image-net.sqlite3

The resize will take about 3 hours, and after that, train.txt and test.txt are generated from train-file.txt and test-file.txt by suffixing .resize.png on every line.

Now, everything is ready. Assuming you have 4 TITAN GPUs as I do, it takes one and half days. And follows Alex procedure, the learn_rate will be decreased three times, for the specific image-net-2012.sqlite3 you see in ./samples, I started with 0.01 learn_rate, decreased to 0.001 at 30th epoch, and then decreased to 0.0001 at 60th epoch, and then decreased to 0.00001 at 80th epoch.

The generated image-net-2012.sqlite3 file is about 600MiB in size because it contains data needed for training and resume. You can either open this file with sqlite command-line tool (it is a vanilla sqlite database file), and do:

sqlite> drop table conv_vary_params;
sqlite> drop table momentum_data;
sqlite> drop table function_state;
sqlite> vacuum;

The file size will shrink to about 200MiB. You can achieve further reduction in file size by rewrite it into half-precision, with ccv_convnet_write and write_param.half_precision = 1. The resulted image-net-2012.sqlite3 is exactly what I included in ./samples.

Can I use the ImageNet pre-trained data model?

ccv is released under FreeBSD 3-clause license, and the pre-trained data models ./samples/image-net-2012.sqlite3 and ./samples/image-net-2012-vgg-d.sqlite3 are released under Creative Commons Attribution 4.0 International License. You can use it, modify it practically anywhere and anyhow with proper attribution. As far as I can tell, this is the first pre-trained data model released under commercial-friendly license (Caffe itself is released under FreeBSD license but its pre-trained data model is “research only” and OverFeat is released under custom research only license).

Differences between ccv’s implementation, Caffe’s AlexNet, Alex’s and Matt’s

Although the network topology of ccv’s implementation followed closely to Matt’s, the reported results diverged significantly enough for me to document the differences in implementation details.

Network Topology:

ccv’s local response normalization layer (if any) followed the convolutional layer, and the pooling layer is after the local response normalization. This is briefly mentioned in Alex’s paper, but in Caffe’s AlexNet, their local response normalization layer followed the pooling layer.

The input dimension to ccv’s implemented network is 225x225, and in Caffe, it is 227x227. Alex’s paper as well as Matt’s mentioned their input size is 224x224. For 225x225, it implies a 1 pixel padding around the input image such that with 7x7 filter and 2 stride size, a 111x111 output will be generated. However, the output of the first convolutional layer in Matt’s paper is 110x110.

Data Preparation:

Caffe’s implementation resizes image into 256x256 size without retaining aspect ratio. Alex’s implementation resizes image into sizes such that the minimal dimension is 256 while retains the aspect ratio (at least as the paper implied) and cropped the image into 256x256 size. ccv’s implementation resizes image into sizes such that the minimal dimension is 257 while retains the aspect ratio (downsamples with CCV_INTER_AREA interpolation and upsamples with CCV_INTER_CUBIC interpoliation if needed). ccv’s implementation obtains the mean image from center cropped 257x257 images.

Data Augmentation:

Caffe’s implementation randomly crops image from 256x256 to 227x227. Alex’s implementation randomly crops image from 256x256 to 224x224 and then applied color augmentation with Gaussian random coefficient sampled with sigma == 0.1. ccv’s implementation randomly crops image from the aspect retained sizes into 257x257, subtract the mean image and then randomly crops it into 225x225, color augmentation is applied with Gaussian random coefficient sampled with sigma == 0.001. Additional image color augmentation is performed to stretch the brightness, the contrast and the saturation between 0.8 to 1.2. All three implementations did horizontal mirroring as a data augmentation technique.

Averaged Classification:

Caffe averages the softmax output of 10 patches from the test image by first resize image into 256x256 without retaining aspect ratio, and then the first 5 patches of size 227x227 cropped from top left, top right, center, bottom left, bottom right of the resized test image, the second 5 patches are the horizontal mirrors of the first 5 patches.

Alex’s implementation averages the softmax output of 10 patches from the test image by first resize image into sizes such that the minimal dimension is 256 while retains the aspect ratio and then center-crops into 256x256. The 10 patches of size 224x224 are sampled from the 256x256 crop the same way as Caffe did.

ccv’s GPU implementation averages the softmax output of 30 patches from the test image by first resize the image into sizes such that the minimal dimension is 257. Then it makes 3 crops from top left, center, and bottom right so that the cropped image is 257x257. The cropped images subtract mean image, and then each cropped from top left, top right, center, bottom left, bottom right into 225x225. This generates 15 patches, and each one of them has its horizontally-mirrored counter-part.

ccv’s CPU implementation for efficiency considerations averages the softmax output of 10 patches from the test image by first resize the image into sizes such that the minimal dimension is 257. The mean image is upsampled into the same size with CCV_INTER_CUBIC and then is subtracted from the resized image. The top left, top right, center, bottom left, bottom right patches of 225x225 is extracted and horizontally mirrored to generate the 10 patches.