All blog posts

X Datasets, Y Labels, Z Versions, ONE Shared Data Storage, NO Duplicates

We just shipped a feature called “Shared Data Storage” and it’s pretty useful if you find yourself attaching different metadata to the same underlying data, or saving multiple datasets with a lot of overlapping content.

For example, imagine storing 100,000 chest x-ray images weighing in at 43GB zipped. That number isn’t a big deal today considering a 1 TB drive costs about $100. But storing this amount of data efficiently can be challenging when you consider a whole team needs access to it, the data changes over time, and that you’d like to keep track of these changes. In a computer vision company – datasets are always evolving with data streaming from production, labeling and analysis. Companies that aren’t yet tracking their dataset mutations have reason to worry – regulation is coming – in the medical, industrial, flight and automotive fields, companies will soon be required by law to track the evolution of their datasets.

While many engineers and data scientists agree it’s valuable to version control data, it is often regarded impractical for large amounts of changes in large datasets. This post will explain the state of the art techniques in version control used to tackle these challenges. For a more detailed how-to, check out our recent post on data volumes.

Data Evolution Tracking Causes Storage Explosion

If your dataset is a 1 GB folder full of files, which is a common occurrence in computer vision, and you want to modify a small 1KB file in it – you might find yourself copying the entirety of the folder and renaming it to “version 2”. Now you have 2 GBs, and you’ll experience this growth for every version or copy of your dataset. While this version control growth is linear, it becomes unwieldy when the number of versions is high or the size of the folder is large. Still, this is the mostly widely employed technique in the world for version control. Some improvements to this scheme can be achieved using soft and hard links which allow multiple files on a drive to share a single location on a storage system, but they can get difficult to manage and communicate across different operating systems.

The Data Structure from Version Control Heaven – Content Addressable Storage (CAS)

In systems where changes are small, and a large amount of content is shared between versions – the current state of the art solution is Content Addressable Storage (CAS). This is the technology underlying:

  • Git, the most popular version control system.
  • Blizzard Entertainment’s game storage system.
  • DropBox cloud storage.
  • Many other version control and asset management systems.

The idea is simple, to store a file:

  1. Calculate the hash of the file’s content.
  2. Store the file in the location of that hash.

For example if the hash of image1.jpg is ea5becb579edd9d14dc5902024cf0d92, then the filename to store the contents of that file would be ./ea5becb579edd9d14dc5902024cf0d92. Often the path is sliced into folders to avoid a high amount of files in a single directory, so the actual location would be ./ea/5b/ecb579edd9d14dc5902024cf0d92.

This has a few interesting properties. The original name of the file doesn’t matter, its storage location only changes if the contents change. No matter how many duplicate references to the same underlying content exist – the storage system will only store it once. Then each version of the whole system is essentially a list of pairs:

Each list of pairs is a version. Here’s a version for example:

This can be further compressed by breaking large files into separate blocks, saving every hashed block separately. This can be useful because the list of files itself can be stored in content addressable block storage efficiently, only updating small parts for each new version. But that adds complexity to the implementation and many solutions only utilize full-file hashing.

Storing many versions and querying with MissingLink

Now that you understand the underlying idea – let’s see how you can utilize the wonders of content addressable storage with MissingLink. You can follow the docs to create a data volume which organizes your dataset files in content addressable storage on your own cloud or local storage. By using the MissingLink command line tool you get the benefits of such a system – efficient storage of many versions of many large files. Every version of data you add to a data volume will store its files in your CAS.

MissingLink data volumes also offer querying abilities based on provided metadata. For example in a self-driving dataset you can query for all the images with pedestrians, ask for 10% of the data, and divide it into 60% training, 20% for validation and 20% for testing. Our users utilize these queries for deep learning analysis and training.

Different Datasets With The Same Data

Sometimes a single image could be flipped, labeled differently, or have a different subset of information attached to it for the purpose of experimentation. MissingLink supports this scenario of multiple different data volumes, each with their own distinct metadata to query. To avoid double spending – we also support sharing the underlying data storage. This feature is called Shared Data Storage. To use it – create another data volume, and instead of choosing a bucket for it to rely on for storage – choose the “Shared storage with existing data volume” option which will let you choose a “parent” volume in which this volume’s data will be persisted.

The end result is one CAS which contains any amount of completely separate data volumes.

Use case

We built this feature for a customer that wanted to have one data volume with their original images, another data volume with a derivative set of images, and a third data volume with all of the original images, derivative images and resulting output data files. The value behind having every single artifact under the sun in one data volume is that analyzing these is one query away. Like a denormalized SQL database. Another potential use-case would be to isolate data sets and labeling sets, in the case that a company needs to, for example, evaluate different labeling providers. In both of these use-cases there would be a lot of duplication of the original data files if we didn’t use a CAS.

Two datasets for the price of one

Overlapping data ends up in exactly one place on the cloud so you get a copy for free. The metadata which MissingLink stores for you to query will help you navigate the datasets based on your needs, and all your files are neatly stored in content addressable and versioned storage. If you’re paying through your nose for cloud storage and would like to consolidate some duplicates – hit us up and request a demo.