Most of the data scientists we speak to store their data in folders on the same computer they are running experiments on. Either the folder names represent the labels or there is metadata sprinkled about. And while this works for smaller projects, it poses some serious challenges when trying to scale. In order to use a subset of the data, it requires manually copying files by hand and properly maintaining the folder structure. Over time you’ll end up with duplicates of the original data, run into challenges training on a previous version, and struggle sharing these one-off datasets. Even worse, If you’re copying these large datasets between hosted computers, you are also paying for the bandwidth. This is where MissingLink Data Volumes come in.
A Data Volume is smart, immutable data lake that allows queryable data exploration, versioning, and automated curation. Data Volumes allow you to not only store large datasets ideal for computer vision experiments, but they provide a reproducible way of retrieving data via queries that ensure consistency during every experiment. Since they are versioned, you can even access data from a specific point in time without fear of corruption when new data is added. Let’s take a look at how to set up a data volume with a large medical image dataset.
Working with an Example Dataset
For this post, we’ll be using a publicly available chest x-ray datasets called ChestXray14. The data set was released in 2017 by the NIH Clinical Center. It includes data from over 30,000 patients, including many with advanced lung disease. Normally this kind of data is not available to the public so it presents a unique opportunity to practice working with very large realistic medical image datasets.
You can download the dataset from https://nihcc.app.box.com/v/ChestXray-NIHCC. The ChestXray14 dataset includes a folder of images, labels and bounding boxes.
We are going to focus on the following items:
- Images Folder – which contains 12 zipped files of x-ray images
- Data_Entry_2017.csv – a CSV file which contains the labels for each image
The bulk of the dataset is comprised of the 12 folders full of x-ray images. In the root of the images folder is a python script to help download all of the images but you can just manually click on each file and do it by hand since the host will not allow you to download the entire project at once due to its size. After all of the image folders are unzipped, you’ll have 45 gigs of data comprised of roughly 65,534 individual images. Finally, with everything downloaded, we can begin to prepare the data for MissingLink’s Data Volume.
Creating a Data Volume
In the MissingLink.ai dashboard, there is a dedicated Data Volume tab on the left of the projects panel.
When you create a new Data Volume, the wizard will walk you through installing MissingLink’s CLI tools in your project. You can also set up a new Data Volume via the CLI by following these instructions.
You’ll want to follow these steps and authorize MissingLink so you can run the Data Volume sync command once everything is configured. It’s also important to note that your data never leaves your premises. When you perform a sync, it is stored on the storage bucket you created during this process. MissingLink never has access to the data you are syncing.
Up next, you’ll need to configure the Data Volume itself. It will need a name, an optional description, and a storage location.
Data Volumes are incredibly versatile, they will work on several common storage buckets such as AWS, Azure, Google Cloud, and even locally.
After you have set up a storage bucket, it’s time to sync your data. The wizard will show you the command you need to run in order to get you started. This consists of the ml data sync call, the id of the Data Volume and the path to the data you want to sync.
Let’s walk through how this is done.
Syncing data allows MissingLink’s SDK to copy the files to the storage bucket attached to the Data Volume. Each sync will move the files over to your storage bucket, index them, stage the new changes, and preserve the folder structure which is useful if your code requires data to be organized in a specific way. When you run the command, you’ll see the progress of the sync in the terminal.
Once the data is synced, you’ll be able to view it before committing it to the Data Volume. This staging process allows you to not only review what changes will happen before you accept the sync, but it also gives you the opportunity to add a comment to clarify what changes are taking place.
When you are happy with the staged changes, commit them and a new version of your data will be created. Every time you perform a sync and commit, the new data is added to the Data Volume and is assigned a unique ID which you can use to query later on. Think of this like Git for your data. Over time you can easily track any changes to the data by viewing the Data Volume history.
At this point, we’ve only discussed uploading the images but most datasets contain a lot of metadata to help add context to each of the data points.
While you can upload the images as is to the data volume, we want to convert the labels inside of the
Data_Entry_2017.cvs file into JSON that help describe each image. MissingLink can use these JSON data points help query and filter the dataset later on. For example, the ChestX-ray dataset comprises 112,120 frontal-view X-ray images of 30,805 unique patients with the fourteen distinct, text-mined disease image labels (where each image can have multi-labels), from the associated radiological reports using natural language processing. It also includes fourteen common thoracic pathologies labels.
The circular diagram shows the proportions of images with multi-labels in each of 8 pathology classes and the labels’ co-occurrence statistics from ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases paper.
All of this data is stored in a CSV file which contains 10 labels:
So our goal is to convert this data into a single JSON file for each image. To do this, I wrote a simple script, which you can see here on GitHub, which processes the CSV file and outputs JSON files that MissingLink’s can use. Here’s what a sample JSON file looks like:
After you have generated the metadata for each of these data points, you can re-run the sync command, and MissingLink will process the JSON files and associate them with each image.
You can also do this as part of the initial sync if your data already has the associated
meta.json files to go with each data point. For this example, I am just showing off how I started with a few sets of raw images then went back and added in the metadata to illustrate the Data Volume versioning. Sometimes, you don’t have all of the data you need when getting started or you get additional metadata back after it has been cleaned up or annotated so Data Volumes support either workflow.
At any point, you can view the contents of a Data Volume by selecting a version of the list and clicking on the Run button to perform a query. This will return all of the items in the Data Volume. Here you can see the first sync I did with 100 images and no metadata.
MissingLink automatically adds a few labels based on what can be inferred from the data such as the path, version number, and size. But let’s take a look at the results after I synced the corresponding JSON metadata:
Here you can see that all of the labels from the CSV file have been correctly attached to each image and you have access to any point of data you may need.
This is just a quick example of how I was able to begin uploading some of ChestXray14 dataset to my own Data Volume. I’d suggest syncing the images two folders at a time, while you generate the metadata from the script I provided, and doing a final sync to complete the dataset in your new Data Volume.
What We’ve Learned
We looked at a publicly available dataset, ChestXray14, for reference on how to do the following with MissingLink Data Volumes:
- Create a new Data Volume to store the X-ray images.
- Add images to the Data Volume with the sync command.
- Stage, commit, and view Data Volume changes.
- Correctly correlate the images to corresponding metadata in the Data Volume.
In a future post, we’ll dig into MissingLink’s powerful query system, how to clone the data locally for testing, and how to create iterators to stream the data while running your experiments.
To learn more about MissingLink’s other services, please request a demo.