September 21, 2018 | Joe Salomon

Using Data Locality for Deep Learning in AWS

Training complex models and prototyping new models are resource-intensive tasks. Therefore, it is critical to know which options in AWS can help reduce training times and to understand the tradeoffs of each. Due to the iterative nature in which the data is processed (with the same batches of data accessed repeatedly), it’s critical to reduce or eliminate as many performance bottlenecks as possible, including network latency and storage I/O time and throughput. The way to achieve this is by processing data locally with the computation host—in other words, ensuring data locality.

When it comes to processing workloads with GPU-based compute instances, data locality becomes even more crucial. GPUs support massive parallelization of data computation and often prioritize higher memory bandwidth over faster CPUs. This allows instances to operate on larger sets of data more efficiently. Data local to the processing instance eliminates the network bottleneck and keeps GPUs more active, which ensures lower batch completion times.

In AWS, data storage solutions such as Amazon EFS or Amazon S3 store data remotely and can be accessed over AWS’s high speed networks to Amazon EC2 compute nodes. Technically, Amazon EBS provides local block-level storage devices to Amazon EC2 using an SAN infrastructure that requires a connection over the network. However, the latency is considered negligible and “local” to the Amazon EC2 instance. In the past, legacy instance types also included ephemeral instance store disks, which were locally attached SSD storage to the underlying hosts. However, this practice of attaching high-speed disks is now reserved to specific instance types—typically to utilize non-volatile memory solid-state drives (NVM SSD).

This article will cover in depth the storage options in AWS, as well as what you should consider when processing workloads for data locality on Amazon EC2 instance types.

Deep Learning in the Cloud

Deep Learning involves training models with thousands of iterations to produce the most accurate model. If a full set of training samples consisted of only 1TB of data, for example, processing 10 epochs (full iterations of the training set) would require 10TB of I/O to be performed. For computer vision use cases, processing high-resolution images means that input dataset sizes are very large. Ensuring the best I/O performance for reading source data by removing any associated network latency tremendously reduces batch processing times. When dataset size increases, it is important to model storage capacity to handle that load at scale, or to scale the processing out across nodes. For instance, Amazon EBS with sustained performance disks (known as Provisioned IOPS), still has maximums per volume and instance. If the workloads exceed those limits to be processed in a reasonable time, scaling horizontally is an effective way to lower processing time.

Ultimately, there are tradeoffs between data locality and centrally managed (network) storage. Let’s dive into some of the pros and cons of local and network storage types and the few ideal instance types.

High-Performance Amazon EBS

The advantages of either populating or copying data to Amazon EBS volumes are achieving data locality and the subsequent performance benefits gained from processing batches locally. The disadvantages are data priming time (copying locally) and that scaling horizontally will likely require dataset   (which typically necessitates a management system to track and maintain the data partitions).

Amazon EBS Instance Storage

Older instance types, such as the M1 or C1 series, included instance storage for free and were the best high-speed options at the time. This storage was directly attached to the underlying hypervisors and provided SSD speeds. Amazon moved away from attaching instance storage to all series types and now utilizes instance storage in more focused workloads, such as the I3 series, covered later.

Amazon EBS General Purpose SSD (GP2)

GP2 is an excellent choice for more standard usage, boasting an automatic 3 IOPS per 1GB of provisioned space. GP2 offers burst credits to temporarily increase performance, allowing roughly 10x IOPS rate. It is not recommended for sustained workloads. In other words, if your batch processing outruns your credit balance, your IOPS will be reduced to your flat 1 to 3GB to IOPS rate. For spike processing, this can be an inexpensive and effective option.

Amazon EBS Provisioned IOPS (PIOPS or io1)

Provisioned IOPS provides 99% consistent performance at an increased cost. Individual volumes can reach a performance of up to 32k IOPS, while most instances cap at 80k max IOPS. RAID0 (striping) multiple PIOPS volumes together can allow for additional speed and are ideal for sustained batch workloads. This volume type is meant for big data, databases, high transaction rates, and heavy performance requirements.

Note: Magnetic, Cold Storage (sc1) and Throughput Optimized HHD (st1) are not covered here, as they are low-performance. Additional AWS performance information can be found here.

Network Solutions

Data locality is great for performance, but workloads are typically copied from a network location. Amazon S3 and Amazon EFS provide object and NFS (POSIX)-style storage respectively. While Amazon EFS is far more performant than Amazon S3, it has lower durability and flexibility and comes at a high cost.

Amazon Simple Storage Service (Amazon S3)

Amazon S3 is an excellent resource for storing virtually unlimited amounts of data. It offers very low-cost storage options at multiple tiers, as well as the capability to lifecycle data into Amazon Glacier for cheap long-term storage. However, it not a solution for datasets and high-performance training. The storage type is object-based and services must be able to query and work with it, unlike POSIX or SMB-style file systems. Also, storage access speeds are low and not intended for processing at a high rate. The primary use case for Amazon S3 is to store datasets in a low-cost, high-durability, high-availability location—and to pull them to local storage for higher performance processing.

Amazon Elastic File System (EFS)

Amazon EFS can create virtual “file systems” that can be mounted with NFS. There are performance tips regarding using Amazon EFS, but considering its performance limits, it’s best served as another type of seeding repository.

If it is more important to run batch processing on a central location, Amazon EFS is a better option than Amazon S3, but Amazon EFS has many inherent limitations. For instance, to create backups of Amazon EFS, it is on you to create the duplication, either with rsync or data transformation jobs (i.e. no snapshotting). Also, Amazon EFS is not cheap storage, starting at 3x the cost of Amazon EBS. Even though it boasts up to a 3GB p/s throughput max, it would be incredibly expensive to reach that rate, as performance starts closer to 5MB p/s. Its best use case is to be a file server, but not in high-performance dataset processing, as it is high cost, lower performance, and suffers when accessed for parallel processing.

Storage Max IOPS Max Throughput Cost Size Limit Use Cases
GP2 (Amazon EBS) 10000 160 MiB p/s $0.10 per GB p/m 16T Low cost, decent performance, good for data locality
PIOPS (Amazon EBS) 32000 500 MiB p/s $0.125 per GB p/m + $0.065 per IOP p/m 16T Moderate cost, high performance, excellent for data locality
Amazon S3 5GB p/s $0.01 – $0.023 per GB p/m *Unlimited Storage/archival
Amazon EFS 3G p/s (Only in some regions) $0.30 per GB p/m (Additional for Provisioned Throughput) **Scales but many limits File services, copying data down to local EBS
* By default, each account can create 100 Amazon S3 Buckets which can hold 40TB each.
** A detailed list of Amazon EFS storage limits can be found here.

Amazon EC2 instance memory can be used to create an in-memory cache using tools like vmtouch, which can pin a set of files into the filesystem cache on Linux. By sharding the input data and using a distributed library such as PyTorch, GPU-based Deep Learning workloads can be scaled horizontally using data parallelism, while still retaining the benefits of data locality. This comes at no additional cost but is constrained by instance memory limits. For very large workloads that do not fit in cache, using Amazon EBS is a better option.

Amazon EC2 Instance Types

Choosing the right Amazon EC2 instance matters in the mix of many factors, including GPU, CPU, memory, storage speed, networking speed, and cost. Most instances are purpose- specific, such as C series (which is CPU-optimized) and R series for memory-based applications. Although there are over a dozen instance types, many are not as suited for Deep Learning because they are not optimized in the best ways for modeling trade-offs (i.e., high storage performance and GPUs vs. cost). The following three instance types specialize in extreme high-speed storage, extreme GPU processing, or a blend in the middle at a reasonable cost.

Amazon EC2 I3 Series Instances: Extreme Storage I/O

The Amazon EC2 I3 series provides high-performance instances with NVM storage optimized for extreme IOPS requirements. AWS boasts speeds of up to 3.3M IOPS and storage up to 15TB per instance. These instances come with a very high cost per hour, so they are best suited for on-demand processing and then termination. This instance type is primarily for data batch processing that is not reliant on GPU-style Deep Learning, but it’s worth mentioning due to the pure IOPS power of NVM. It’s excellent for a high transaction, low latency-style processing.

Amazon EC2 P3 Series Instances: Extreme GPU

The Amazon EC2 P3 series GPU-optimized instances are optimized for Deep Learning and are the better choice considering their ability to use up to 8 Tesla V100 GPUs, 128GB GPU memory, 64 CPUs, and 488GiB of RAM. The Amazon EC2 P3 series is ideal for parallel processing, takes advantage of Amazon’s Enhanced Network Adapters (ENA), and is designed to scale out to handle scientific workloads, modeling, machine learning, and extreme computational batches. These servers are the most suited for computer vision training, but the per-hour rate is very high. To mitigate balloon costs, provision for vision training and then terminate instances.

Amazon EC2 M5 Series Instances: Balanced

Amazon’s Mx series or generic compute series, is typically a good instance type to use for varying or unpredictable workloads. The series strikes a balance between cost, performance, and options. The M5 series deserves special mention for data locality processing since the instances are fairly inexpensive and can be launched with NVM SSD instance storage attached, which can be loaded with modeling data. The largest instance type boasts 384G of RAM, 96 CPUs, and 3.5TB of NVM storage. It’s a perfect middle-road platform choice.

Conclusion

As discussed in this article, data locality is vital for improving Deep Learning performance. AWS offers multiple options that vary in speed, size, cost, and capabilities. As shown, for optimal Deep Learning, using Amazon EC2 P3 instances with PIOPS volumes provides the greatest capability of achieving high-speed data locality. This configuration allows for heavy computational and scalable power to process large datasets in AWS.

Joe Salomon
Joe Salomon
VP Product