Deep Learning in Healthcare Cover

Deep Learning Frameworks

TensorFlow ResNet: Building, Training and Scaling Residual Networks on TensorFlow

ResNet won first place in the Large Scale Visual Recognition Challenge (ILSVRC) in 2015. It was the first neural network not affected by the “vanishing gradient” problem. TensorFlow makes it easy to build ResNet models: you can run pre-trained ResNet-50 models, or build your own custom ResNet implementation. We’ll show you how to do this with ImageNet or CIFAR-10 datasets, and how to automate the process through MissingLink’s deep learning platform.

What is ResNet?

Residual Network (ResNet) is a Convolutional Neural Network (CNN) architecture, designed to train very deep neural networks. In theory, a deeper neural network should perform better on the training set because of additional layers processing smaller and smaller features. In reality, the frequency of training errors increases the network is too deep. With ResNet, the training error actually decreases as the network gets deeper.

 

ResNet provides a breakthrough solution to the “vanishing gradient” problem. Vanishing gradient is a difficulty encountered when you train artificial neural networks with gradient-based methods like backpropagation. With these methods, the gradients of the loss function approach zero as you add more layers to the network. This makes it hard to learn and tune the parameters of the earlier layers in the network.


What are Identity Shortcut Connections?

ResNet is based on “shortcut connections”. This is a way to skip the training of one or more layers — creating a residual block.

 

Residual blocks allow you to train much deeper neural networks. ResNet structured by taking many of these blocks and stacking them together to form a deep network.


Why it’s Difficult to Run ResNet Yourself and How MissingLink Can Help

ResNet can have between dozens and thousands of convolutional layers and can take a long time to train and execute —from hours to several weeks in extreme cases. You will need to distribute a ResNet model across multiple GPUs, and if performance is insufficient, scale out to multiple machines.

 

However, you’ll find that running a deep learning model on multiple machines is difficult:

  • On-premises—you will need to set up multiple machines for deep learning, manually run experiments and utilize resources.
  • In the cloud—you can spin up machines quickly, but you will need to build and test machine images, and manually run experiments on each machine. You’ll need to “babysit” your machines to ensure an experiment is always running, and to avoid wasting money with expensive GPU machines.

 

MissingLink solves all that. It’s a deep learning platform that lets you scale out ResNet and other computer vision models automatically across numerous machines.

missinglink screenshot

 

Just set up jobs in the MissingLink dashboard, define your cluster of on-premise or cloud machines, and the jobs will automatically run on your cluster of machines. You can train a ResNet model in minutes – not hours or days.

 

To avoid idle time, MissingLink immediately runs another experiment when the previous one ends, and cleanly shuts down cloud machines when all jobs complete.

 

Learn more about the MissingLink platform.


Options for Running ResNet on TensorFlow

Using a Pre-Trained Model

The TensorFlow official models are a collection of example models that use TensorFlow’s high-level APIs. The official TensorFlow Resnet model contains an implementation of ResNet for the ImageNet and the CIFAR-10 datasets written in TensorFlow.

 

You can download pre-trained versions of ResNet-50. There are four versions of ResNet-50 available with different precision accuracies. You can use transfer learning to speed up the process of training the pre-trained model. In transfer learning, a pre-trained model is used as a starting point of the next layer. In addition, you can freeze all of the layers, except the final fully connected layers, when fine-tuning your model.

Train TensorFlow ResNet From Scratch

While transfer learning is a powerful technique, you’ll find it valuable to learn how to train ResNet from scratch. Become familiar with the full training process, from launching TensorFlow, downloading and preparing ImageNet, to documenting and reporting training results.

 

To illustrate the process, here is Exxact’s TensorFlow code on how to train a ResNet model from scratch in TensorFlow:

1. Launch TensorFlow environment with Docker 

nvidia-docker run -it -v /data:/datasets tensorflow/tensorflow:nightly-gpu bash

2. Download ImageNet

2.1 Clone the TPU repository

git clone https://github.com/tensorflow/tpu.git

2.2 install the GCS dependencies

pip install gcloud google-cloud-storage

2.3 Downloads the files from Image-Net.org

python imagenet_to_gcs.py --local_scratch_dir=/data/imagenet --nogcs_upload

3. Download Official TensorFlow models

git clone https://github.com/tensorflow/models.git

4. Export PYTHONPATH

Export PYTONPATH to the models folder on your machine. Be sure to replace /datasets/models with your folder path.

export PYTHONPATH="$PYTHONPATH:/datasets/models"

5. Install Dependencies

pip install --user -r official/requirements.txt

6. Run the training script imagenet_main.py

python imagenet_main.py --data_dir=/data/imagenet/train --num_gpus= 2 --batch_size=64 --resnet_size= 50 --model_dir=/data/imagenet/trained_model/Resnet50_bs64 --train_epochs=120

Scaling ResNet on TensorFlow with MissingLink

In this article, we learned the basics of ResNet and saw two ways to run TensorFlow ResNet:

 

  • Using a pre-trained model and transfer learning
  • Building ResNet components from scratch

 

Training ResNet is extremely computationally intensive, especially when working with a large number of layers. Don’t wait for hours or days for ResNet to train. Use the MissingLink deep learning framework to:

 

  • Scale out ResNet automatically across numerous machines, either on-premise or in the cloud.
  • Define a cluster of machines and automatically run deep learning jobs, with optimal resource utilization.
  • Avoid idle time by immediately running experiments one after the other, and shutting down cloud machines cleanly when the jobs are complete.

 

 

MissingLink can also help you manage large numbers of experiments, track and share results, and manage large datasets and sync them easily to training machines.

Learn more about the MissingLink deep learning platform.

Train Deep Learning Models 20X Faster

Let us show you how you can:

  • Run experiments across hundreds of machines
  • Easily collaborate with your team on experiments
  • Reproduce experiments with one click
  • Save time and immediately understand what works and what doesn’t

MissingLink is the most comprehensive deep learning platform to manage experiments, data, and resources more frequently, at scale and with greater confidence.

Request your personal demo to start training models faster

    Thank you!
    We will be in touch with more information in one business day.
    In the meantime, why not check out how Nanit is using MissingLink to streamline deep learning training and accelerate time to Market.