TensorFlow ResNet: Building, Training and Scaling Residual Networks on TensorFlow
ResNet won first place in the Large Scale Visual Recognition Challenge (ILSVRC) in 2015. It was the first neural network not affected by the “vanishing gradient” problem. TensorFlow makes it easy to build ResNet models: you can run pre-trained ResNet-50 models, or build your own custom ResNet implementation. We show how to do this with ImageNet or CIFAR-10 datasets.
In this article you will learn:
- What is ResNet
- Understand Identity Shortcut Connections
- Why it’s difficult to run ResNet without infrastructure automation
- Options for Running ResNet on TensorFlow
- Scaling ResNet on TensorFlow with MissingLink
What is ResNet?
Residual Network (ResNet) is a Convolutional Neural Network (CNN) architecture, designed to train very deep neural networks. In theory, a deeper neural network should perform better on the training set because of additional layers processing smaller and smaller features. In reality, the frequency of training errors increases the network is too deep. With ResNet, the training error actually decreases as the network gets deeper.
ResNet provides a breakthrough solution to the “vanishing gradient” problem. Vanishing gradient is a difficulty encountered when you train artificial neural networks with gradient-based methods like backpropagation. With these methods, the gradients of the loss function approach zero as you add more layers to the network. This makes it hard to learn and tune the parameters of the earlier layers in the network.
What are Identity Shortcut Connections?
ResNet is based on “shortcut connections”. This is a way to skip the training of one or more layers — creating a residual block.
Residual blocks allow you to train much deeper neural networks. ResNet structured by taking many of these blocks and stacking them together to form a deep network.
Why it’s Difficult to Run ResNet Yourself and How MissingLink Can Help
ResNet can have between dozens and thousands of convolutional layers and can take a long time to train and execute —from hours to several weeks in extreme cases. You will need to distribute a ResNet model across multiple GPUs, and if performance is insufficient, scale out to multiple machines.
However, you’ll find that running a deep learning model on multiple machines is difficult:
- On-premises—you will need to set up multiple machines for deep learning, manually run experiments and utilize resources.
- In the cloud—you can spin up machines quickly, but you will need to build and test machine images, and manually run experiments on each machine. You’ll need to “babysit” your machines to ensure an experiment is always running, and to avoid wasting money with expensive GPU machines.
MissingLink solves all that. It’s a deep learning platform that lets you scale out ResNet and other computer vision models automatically across numerous machines.
Just set up jobs in the MissingLink dashboard, define your cluster of on-premise or cloud machines, and the jobs will automatically run on your cluster of machines. You can train a ResNet model in minutes – not hours or days.
To avoid idle time, MissingLink immediately runs another experiment when the previous one ends, and cleanly shuts down cloud machines when all jobs complete.
Learn more about the MissingLink platform.
Options for Running ResNet on TensorFlow
Using a Pre-Trained Model
The TensorFlow official models are a collection of example models that use TensorFlow’s high-level APIs. The official TensorFlow Resnet model contains an implementation of ResNet for the ImageNet and the CIFAR-10 datasets written in TensorFlow.
You can download pre-trained versions of ResNet-50. There are four versions of ResNet-50 available with different precision accuracies. You can use transfer learning to speed up the process of training the pre-trained model. In transfer learning, a pre-trained model is used as a starting point of the next layer. In addition, you can freeze all of the layers, except the final fully connected layers, when fine-tuning your model.
Train TensorFlow ResNet From Scratch
While transfer learning is a powerful technique, you’ll find it valuable to learn how to train ResNet from scratch. Become familiar with the full training process, from launching TensorFlow, downloading and preparing ImageNet, to documenting and reporting training results.
To illustrate the process, here is Exxact’s TensorFlow code on how to train a ResNet model from scratch in TensorFlow:
1. Launch TensorFlow environment with Docker
nvidia-docker run -it -v /data:/datasets tensorflow/tensorflow:nightly-gpu bash
2. Download ImageNet
2.1 Clone the TPU repository
git clone https://github.com/tensorflow/tpu.git
2.2 install the GCS dependencies
pip install gcloud google-cloud-storage
2.3 Downloads the files from Image-Net.org
python imagenet_to_gcs.py --local_scratch_dir=/data/imagenet --nogcs_upload
3. Download Official TensorFlow models
git clone https://github.com/tensorflow/models.git
4. Export PYTHONPATH
Export PYTONPATH to the models folder on your machine. Be sure to replace /datasets/models with your folder path.
5. Install Dependencies
pip install --user -r official/requirements.txt
6. Run the training script imagenet_main.py
python imagenet_main.py --data_dir=/data/imagenet/train --num_gpus= 2 --batch_size=64 --resnet_size= 50 --model_dir=/data/imagenet/trained_model/Resnet50_bs64 --train_epochs=120
Scaling ResNet on TensorFlow with MissingLink
In this article, we learned the basics of ResNet and saw two ways to run TensorFlow ResNet:
- Using a pre-trained model and transfer learning
- Building ResNet components from scratch
Training ResNet is extremely computationally intensive, especially when working with a large number of layers. Don’t wait for hours or days for ResNet to train. Use the MissingLink deep learning framework to:
- Scale out ResNet automatically across numerous machines, either on-premise or in the cloud.
- Define a cluster of machines and automatically run deep learning jobs, with optimal resource utilization.
- Avoid idle time by immediately running experiments one after the other, and shutting down cloud machines cleanly when the jobs are complete.
MissingLink can also help you manage large numbers of experiments, track and share results, and manage large datasets and sync them easily to training machines.
Learn more about the MissingLink deep learning platform.