All blog posts

Query, Clone, and Stream 45 gigs of X-ray Images with MissingLink Data Volumes

In this previous post, we discussed how to create MissingLink Data Volumes and sync the ChestXray14 dataset. For this post, we are going to dig a bit deeper into the features that help make using Data Volumes instrumental for working with large and complex datasets. Let’s take a look at a new Data Volume I created that contains 45 gigs of X-ray images and their corresponding labels.

As you can see, I synced two image folders at a time and in the final sync, I included the metadata for all of the images. Each time I synced new data to the volume, it was versioned, allowing me to not only see a record of how the dataset has changed over time, but it also allows me to work with a specific snapshot in time. Outside of versioning your dataset changes, using Data Volumes to store this information isn’t very helpful. That’s why we created a unique query system to help you slice the data to gain access to precisely what you need to run your experiments.

Working With Data Volume Queries

The MissingLink query syntax uses a subset of the Lucene Query Syntax. It’s a similar syntax to what you would see in something like Gmail. It’s relatively easy to learn since the syntax structure is straightforward. For example, a simple query would look like this:

When access a Data Volume’s version in the dashboard, you’ll be presented with a command line to enter your own custom queries at the top of the page.

You can return all of the data of the current version by hitting the Run button or pressing Ctrl + Enter without a query. Each result will be an item and its associated metadata. Labels for the data are displayed as column headers and clicking on an item will give you more detailed information.

But let’s say you need to find a specific subset of data, such as all of the X-ray images that contain the label for Pneumonia. Well, you can enter a query for Finding_Labels:Pneumonia and hit Ctrl + Enter. To make things easier, you’ll get auto-completion for labels as you begin typing them:

After running this query, we’ll get a new set of results:

This is just a simple example. Let’s say we need to do something a bit more complicated, like find a more complex dataset. Suppose we need to get a sampling of all female patients ages 18 to 55 with a single finding value of false, and less than five follow-ups.

At this point, you can begin to see the real power behind the query system, but this is just the beginning. The real fun happens when you want to split the data up into train, test, and validation subsets. You can add @split:0.6:0.2:0.2 to any query in order to do that. You can even alter the sample size as well by adding @sample:0.1 like so:

Notice how we now have 1,419 results instead of 13,814 from the previous query. Well, these are just a few examples of how to leverage queries to create new datasets from an existing Data Volume. Let’s take a look at how to get this data to run an experiment.

Cloning Data

MissingLink Data Volumes allow you to clone data to a new location based on your experiment’s needs. For example, let’s say you want to use one of the previous query examples to create a sub-dataset for local testing. The easiest way to do this is to click the icon to the left of the “Run” button to get a clone command.

This will copy the clone command to your clipboard which looks like this:

As you can see, we are using the MissingLink CLI to perform a data clone and passing in the query we just used. The command also passes in reference to the version of the data being used. This is important because you can use the same query on different versions of the dataset to compare experiment results.

When you run this on the command line, it will be downloaded to a local drive based on the path of the destination folder you defined. By default, it will be added to the root of the current project. For this clone example, I’ve changed the destination folder path to a temporary directory in my project.

While this is useful for one-off experiments, there are better ways of getting the same data instead of cloning to a local machine.

Streaming Data from a Data Volume With Iterators

Up until this point, we’ve been manually working with the data by hand. While having access to the query system to slice the data is a considerable step above manually copying files on your computer, Data Volumes when you use Iterators to stream the data directly to the experiment you are working with.

At a high level, a Data Iterator allows you to stream the data while the experiment is running. This could mean the difference of waiting hours for local files to be processed to an experiment instantly running as only the data it needs is loaded. Even better, the requested data is locally cached, which can speed up the time it takes to re-run an experiment considerably. MissingLink supports Iterators for several frameworks, but we’ll take a look at an example Keras one.  

To start out we need to import and configure MissingLink

As you can see, we are simply importing the MissingLink SDK, setting a variable for the Data Volume we want to use and saving the query we want to run. You’ll also notice that the query contains the version number at the end. You can use the same query against different versions as you run concurrent experiments.

Up next, we need to create the callback itself:

Here you can see we are defining a batch size which we’ll use later and passing the project ID into the KerasCallback so, we don’t have to select the project from the terminal window each time we run the experiment if you have multiple projects you are working with on MissingLink.

Once everything is configured, we can create a function to process deserializing the data returns by the Iterator:

For this example, we are directly printing the file name, size, and some of the associated metadata for the image is retrieved. If this were a real callback, it would process the data and get it ready for the model.

The last part of the code generates and executes the iterators:

As you can see, we are defining the data generator and creating train, test, and validate generators from the query we described earlier. After we have the iterators, we loop through them and print the item. For this example, we break after the first loop since this will continue indefinitely.

In just a few lines of code, we are now able to stream any query directly to the model and remove the bottleneck of managing the data locally. MissingLink will now manage to load each data point and its corresponding metadata, which in turn should significantly speed up the start time of your experiments. There is a lot more you can do with iterators, especially as you customize them for your needs which I’ll cover in a future post.

What We’ve Learned

After uploading all of the ChestXray14 data to a MissingLink Data Volume we explored how to access the data in the following ways:

  • Creating queries to slice the data.
  • Using queries to create train, test and validation datasets.
  • Cloning data to a local computer.
  • Stream data from the Data Volume via an iterator.

In a future post, we’ll dig into a real-world examples of how to tie all of this together by levering MissingLink’s Data Volume, the ChestXray14 dataset, custom queries and implementing an iterator to train an actual model.