Important announcement: Missinglink has shut down. Click here to learn more.

Deep Learning Frameworks Cover


TensorFlow Distributed Training: Introduction and Tutorials

Large-scale deep learning models take a long time to run and can benefit from distributing the work across multiple resources. TensorFlow can help you distribute training across multiple CPUs or GPUs.


We’ll explain how TensorFlow distributed training works and show brief tutorials to get you oriented. We’ll also show how to take distributed resources to the next level, scaling up experiments on hundreds of machines on and off the cloud with the MissingLink deep learning platform.

Introduction to TensorFlow Distributed Training

To run large deep learning models, or a large number of experiments, you will need to distribute them across multiple CPUs, GPUs or machines. TensorFlow can help you distribute your training by splitting models over many devices and carrying out the training in parallel on the devices.

Types of Parallelism in Distributed Deep Learning

There are two main methods of implementation that you can use to distribute deep learning model training: model parallelism or data parallelism. At times, a single approach will lead to better application performance, while at other times a combination of the two approaches will improve performance.


Model parallelism

In model parallelism, the model is segmented into different parts so you can run it in parallel. You can run each part on the same data in different nodes. This approach may decrease the need for communication between workers, as workers only need to synchronize the shared parameters.


Model parallelism also works well for GPUs in a single server that shares a high-speed bus and with larger models, as hardware constraints per node are not a  limitation.


Data parallelism

In this mode, the training data is divided into multiple subsets. You run each subset on the same replicated model in a different node. You must synchronize the model parameters at the end of the batch computation to ensure they are training a consistent model because the prediction errors are computed independently on each machine.


Each device must, therefore, send notification of all changes, to all of the models of all the other devices.


Distributed TensorFlow using tf.distribute.Strategy

The DistributionStrategy API is an easy way to distribute training workloads across multiple machines. DistributionStrategy API is designed to give users access to existing models and code. Users only need to make minimal changes to the models and code to enable distributed training.


Supported types of distribution strategies


MirroredStrategy: Support synchronous distributed training on multiple GPUs on one machine. It creates one replica per GPU device. Each variable in the model is mirrored across all the replicas. Together, these variables form a single conceptual variable called MirroredVariable. TMirroredVariables are kept in sync with each other by applying identical updates.


MultiWorkerMirroredStrategy: This is a version of MirroredStrategy for multi-worker training. It implements synchronous distributed training across multiple workers, each of which can have multiple GPUs. It creates copies of all variables in the model on each device across all workers.


ParameterServerStrategy: Supports parameter servers. It can be used for multi-GPU synchronous local training or asynchronous multi-machine training. When used to train locally on one machine, variables are not mirrored, instead they are placed on the CPU and operations are replicated across all local GPUs.

Make the Most of TensorFlow Multi-GPU Machines with MissingLink

When running distributed experiments on multi-GPU machines, it is very important to ensure high utilization. MissingLink can help to better utilize expensive GPU resources by scheduling multiple experiments, keeping track of TensorFlow hyperparameters and datasets and seeing exactly what works


missinglink screenshot

Quick Tutorial #1: Distribution Strategy API with Estimator

In this example, we will use the MirroredStrategy with the Estimator class to train and evaluate a simple TensorFlow model. This example is based on the official TensorFlow documentation.

def model_fn(features, labels, mode):
  layer = tf.layers.Dense(1)
  logits = layer(features)

  if mode == tf.estimator.ModeKeys.PREDICT:
    predictions = {"logits": logits}
    return tf.estimator.EstimatorSpec(mode, predictions=predictions)

  loss = tf.losses.mean_squared_error(
      labels=labels, predictions=tf.reshape(logits, []))

  if mode == tf.estimator.ModeKeys.EVAL:
    return tf.estimator.EstimatorSpec(mode, loss=loss)

  if mode == tf.estimator.ModeKeys.TRAIN:
    train_op = tf.train.GradientDescentOptimizer(0.2).minimize(loss)
    return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)


Define a simple input function to feed data for training this model.


def input_fn():
  features =[[1.]]).repeat(100)
  labels =
  return, labels))


Now that we have a model function and input function defined, we can define the estimator. To use MirroredStrategy, all we need to do is:


distribution = tf.contrib.distribute.MirroredStrategy()
config = tf.estimator.RunConfig(train_distribute=distribution)
classifier = tf.estimator.Estimator(model_fn=model_fn, config=config)


This change will now configure Estimator to run on all GPUs on your machine.


Distributed TensorFlow using Horovod

Horovod is a distributed training framework for TensorFlow, Keras, PyTorch, and MXNet.

The data-parallel distributed training paradigm under Horovod is straightforward:


1. Run multiple copies of the training script and each copy:


  • Reads a chunk of the data
  • Runs it through the model
  • Computes model updates (gradients)


2. Average gradients among those multiple copies


3. Update the model


4. Repeat (from Step 1)


Quick Tutorial #2: Use Horovod in TensorFlow with Estimators

This tutorial is based on an article by Jordi Torres


1. Import Horovod:

import horovod.tensorflow as hvd


2. Horovod must be initialized before starting:



3. Pin a server GPU to be used by this process

With the typical setup of one GPU per process, this can be set to local rank. In that case, the first process on the server will be allocated the first GPU, the second process will be allocated the second GPU and so forth.

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.gpu_options.visible_device_list = str(hvd.local_rank())
estimator = tf.estimator.Estimator(


4. Wrap optimizer in hvd.DistributedOptimizer.

The distributed optimizer delegates gradient computation to the original optimizer, averages gradients using allreduce or allgather, and then applies those averaged gradients.


5. Add hvd.BroadcastGlobalVariablesHook(0)

This is necessary to ensure consistent initialization of all workers when training is started with random weights or restored from a checkpoint. Alternatively, if you’re not using MonitoredTrainingSession, you can simply execute the hvd.broadcast_global_variables op after global variables have been initialized.


6. Modify code

Modify your code to save checkpoints only on worker 0 to prevent other workers from corrupting them.

model_dir=None if hvd.rank() != 0


7. Run Horovod

To run on a machine with 4 GPUs we will use mpirun to run the python script:

mpirun -np 4 \
 -bind-to none -map-by slot \
 -mca pml ob1 -mca btl ^openib \

To run on 4 machines with 4 GPUs each (16 GPUs):

mpirun -np 16 \
 -bind-to none -map-by slot \
 -mca pml ob1 -mca btl ^openib \


Train Deep Learning Models 20X Faster

Let us show you how you can:

  • Run experiments across hundreds of machines
  • Easily collaborate with your team on experiments
  • Reproduce experiments with one click
  • Save time and immediately understand what works and what doesn’t

MissingLink is the most comprehensive deep learning platform to manage experiments, data, and resources more frequently, at scale and with greater confidence.

Request your personal demo to start training models faster

    Thank you!
    We will be in touch with more information in one business day.
    In the meantime, why not check out how Nanit is using MissingLink to streamline deep learning training and accelerate time to Market.