Deep Learning Frameworks

TensorFlow Distributed Training: Introduction and Tutorials

Introduction to TensorFlow Distributed Training

To run large deep learning models, or a large number of experiments, you will need to distribute them across multiple CPUs, GPUs or machines. TensorFlow can help you distribute your training by splitting models over many devices and carrying out the training in parallel on the devices.


Types of Parallelism in Distributed Deep Learning

There are two main methods of implementation that you can use to distribute deep learning model training: model parallelism or data parallelism. At times, a single approach will lead to better application performance, while at other times a combination of the two approaches will improve performance.


Model parallelism

In model parallelism, the model is segmented into different parts so you can run it in parallel. You can run each part on the same data in different nodes. This approach may decrease the need for communication between workers, as workers only need to synchronize the shared parameters.


Model parallelism also works well for GPUs in a single server that shares a high-speed bus and with larger models, as hardware constraints per node are not a  limitation.


Data parallelism

In this mode, the training data is divided into multiple subsets. You run each subset on the same replicated model in a different node. You must synchronize the model parameters at the end of the batch computation to ensure they are training a consistent model because the prediction errors are computed independently on each machine.


Each device must, therefore, send notification of all changes, to all of the models of all the other devices.


Distributed Training in TensorFlow

Some large neural networks models cannot fit in the memory of a single GPU device. Splitting such models over many devices is necessary, and training needs to be carried out in parallel on the devices.


TensorFlow is designed to run on multiple machines to distribute training workloads. Distributed TensorFlow can also reduce experiment time by running many experiments in parallel on many GPUs and servers.


Distributed TensorFlow using tf.distribute.Strategy

The DistributionStrategy API is an easy way to distribute training workloads across multiple machines. DistributionStrategy API is designed to give users access to existing models and code. Users only need to make minimal changes to the models and code to enable distributed training.


Supported types of distribution strategies


MirroredStrategy: Support synchronous distributed training on multiple GPUs on one machine. It creates one replica per GPU device. Each variable in the model is mirrored across all the replicas. Together, these variables form a single conceptual variable called MirroredVariable. TMirroredVariables are kept in sync with each other by applying identical updates.


MultiWorkerMirroredStrategy: This is a version of MirroredStrategy for multi-worker training. It implements synchronous distributed training across multiple workers, each of which can have multiple GPUs. It creates copies of all variables in the model on each device across all workers.


ParameterServerStrategy: Supports parameter servers. It can be used for multi-GPU synchronous local training or asynchronous multi-machine training. When used to train locally on one machine, variables are not mirrored, instead they are placed on the CPU and operations are replicated across all local GPUs.


Make the Most of TensorFlow Multi-GPU Machines with MissingLink

When running distributed experiments on multi-GPU machines, it is very important to ensure high utilization. MissingLink can help to better utilize expensive GPU resources by scheduling multiple experiments, keeping track of TensorFlow hyperparameters and datasets and seeing exactly what works


missinglink screenshot

Quick Tutorial #1: Distribution Strategy API with Estimator

In this example, we will use the MirroredStrategy with the Estimator class to train and evaluate a simple TensorFlow model. This example is based on the official TensorFlow documentation.


def model_fn(features, labels, mode):
  layer = tf.layers.Dense(1)
  logits = layer(features)

  if mode == tf.estimator.ModeKeys.PREDICT:
    predictions = {"logits": logits}
    return tf.estimator.EstimatorSpec(mode, predictions=predictions)

  loss = tf.losses.mean_squared_error(
      labels=labels, predictions=tf.reshape(logits, []))

  if mode == tf.estimator.ModeKeys.EVAL:
    return tf.estimator.EstimatorSpec(mode, loss=loss)

  if mode == tf.estimator.ModeKeys.TRAIN:
    train_op = tf.train.GradientDescentOptimizer(0.2).minimize(loss)
    return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)


Define a simple input function to feed data for training this model.


def input_fn():
  features =[[1.]]).repeat(100)
  labels =
  return, labels))


Now that we have a model function and input function defined, we can define the estimator. To use MirroredStrategy, all we need to do is:


distribution = tf.contrib.distribute.MirroredStrategy()
config = tf.estimator.RunConfig(train_distribute=distribution)
classifier = tf.estimator.Estimator(model_fn=model_fn, config=config)


This change will now configure Estimator to run on all GPUs on your machine.


Distributed TensorFlow using Horovod

Horovod is a distributed training framework for TensorFlow, Keras, PyTorch, and MXNet.

The data-parallel distributed training paradigm under Horovod is straightforward:


1. Run multiple copies of the training script and each copy:


  • Reads a chunk of the data
  • Runs it through the model
  • Computes model updates (gradients)


2. Average gradients among those multiple copies


3. Update the model


4. Repeat (from Step 1)


Quick Tutorial #2: Use Horovod in TensorFlow with Estimators


This tutorial is based on an article by Jordi Torres


1. Import Horovod:


import horovod.tensorflow as hvd


2. Horovod must be initialized before starting:




3. Pin a server GPU to be used by this process


With the typical setup of one GPU per process, this can be set to local rank. In that case, the first process on the server will be allocated the first GPU, the second process will be allocated the second GPU and so forth.


config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.gpu_options.visible_device_list = str(hvd.local_rank())
estimator = tf.estimator.Estimator(


4. Wrap optimizer in hvd.DistributedOptimizer.


The distributed optimizer delegates gradient computation to the original optimizer, averages gradients using allreduce or allgather, and then applies those averaged gradients.


5. Add hvd.BroadcastGlobalVariablesHook(0)


This is necessary to ensure consistent initialization of all workers when training is started with random weights or restored from a checkpoint. Alternatively, if you’re not using MonitoredTrainingSession, you can simply execute the hvd.broadcast_global_variables op after global variables have been initialized.


6. Modify code


Modify your code to save checkpoints only on worker 0 to prevent other workers from corrupting them.


model_dir=None if hvd.rank() != 0


7. Run Horovod


To run on a machine with 4 GPUs we will use mpirun to run the python script:


mpirun -np 4 \
 -bind-to none -map-by slot \
 -mca pml ob1 -mca btl ^openib \


To run on 4 machines with 4 GPUs each (16 GPUs):


mpirun -np 16 \
 -bind-to none -map-by slot \
 -mca pml ob1 -mca btl ^openib \


Improving GPU Efficiency for Deep Learning with MissingLink

MissingLink + TensorFlow Make the Most of Distributed Training in TensorFlow

Start using Missinlink’s platform to manage your deep learning experiments and create a more efficient multi-GPU cluster setup with little to no idle time.

Learn More About Deep Learning Frameworks