Skip to content

Syncing Data Points

Once you have created a data volume, you can sync data to it using the MissingLink CLI.

Syncing data allows MissingLink’s SDK to copy the files to the storage bucket attached to the data volume. Each sync moves the files over to your storage bucket, indexes them, stages the new changes, and preserves the folder structure, which is useful if your code requires data to be organized in a specific way.

Preparing for sync

A data point can consist of one or more files. For example, in the VOC2007 dataset, each data point can have up to four files: the raw image, an XML file with annotations, a segmentation JPG file, and a classification JPG file.

To be able to query and filter the dataset using MissingLink, you'll need to add an additional file that shares the same name as the original file but with a .metadata.json extension. This file will contain the attributes on which you wish to query the dataset and filter the data points. We'll refer to this file as queryable metadata.

For example, if you have a file named myfile.jpg, the queryable metadata file name will be myfile.jpg.metadata.json.

Note

  1. If your data point consists of more than one file, you will need to create a queryable metadata file for each one of the files.
  2. Ensure that you create the .metadata.json files in the same folder as your current dataset. The JSON file contains a dictionary of attributes that can have basic and complex type values (for example, string, number, JSON objects).
  3. It is recommended to add an attribute to the queryable metadata and assign to it the same value for all the files that constitute the data point. For examples that show how doing so can be useful, see the @group_by and @datapoint_by operators in Query Syntax.

data point 1

Syncing data

Use the following CLI command:

ml data sync yourDataVolumeID --data-path pathToYourData

You can obtain the full command, including the ID of your particular data volume, from the Wizard screen in the MissingLink web dashboard.

To display and run the sync command:

  1. Select Wizard from the menu at the right end of the data volume to which you wish to sync data.

    Added Data 1

  2. The following screen appears, displaying the full ml data sync command.

    Added Data 2

    Click the Copy Command icon.

  3. Paste the command at the command prompt and run it.

    MissingLink recursively adds every file that is found in the directory or subdirectories of the path provided.

    The progress of the sync is displayed in the terminal.

Once the data is synced, you can view it before committing it to the Data Volume. This staging process allows you to not only review what changes will happen before you accept the sync, but it also gives you the opportunity to add a comment to clarify what changes are taking place.

Dashboard shows data that is synced and staged

When you are satisfied with the staged changes, commit them to create a new version of your data.

For more information on staging and versioning, see Data Version Control.

Note

  1. The ml data sync command syncs only the changes that are not yet in the data volume. For example, if you sync a directory once and then change one file and sync again, only the changed file is uploaded to the data volume.
  2. If you add a metadata field that was already synced with a different casing, the original casing is used. For example, if a data volume already had a metadata field Dog, and a subsequent sync contains a metadata field dog, the new metadata is saved using the original Dog.
  • For a full description of the ml data sync command and the flags available, see the CLI reference.
  • There are more advanced examples of sync commands here.

Performing additional syncs

Normally, following the first data sync described here, you are likely to perform additional syncs.

For every successive sync, MissingLink appends the new files it finds to the existing data and assigns the resulting data volume a unique ID. The folder being synced can contain previously synced files, or not -- the mechanism is the same in all cases.