Syncing Data Points
Once you have created a data volume, you can sync data to it using the MissingLink CLI.
Syncing data allows MissingLink’s SDK to copy the files to the storage bucket attached to the data volume. Each sync moves the files over to your storage bucket, indexes them, stages the new changes, and preserves the folder structure, which is useful if your code requires data to be organized in a specific way.
Preparing for sync
A data point can consist of one or more files. For example, in the VOC2007 dataset, each data point can have up to four files: the raw image, an XML file with annotations, a segmentation JPG file, and a classification JPG file.
To be able to query and filter the dataset using MissingLink, you'll need to add an additional file that shares the same name as the original file but with a
.metadata.json extension. This file will contain the attributes on which you wish to query the dataset and filter the data points. We'll refer to this file as queryable metadata.
For example, if you have a file named
myfile.jpg, the queryable metadata file name will be
- If your data point consists of more than one file, you will need to create a queryable metadata file for each one of the files.
- Ensure that you create the
.metadata.jsonfiles in the same folder as your current dataset. The JSON file contains a dictionary of attributes that can have basic and complex type values (for example, string, number, JSON objects).
- It is recommended to add an attribute to the queryable metadata and assign to it the same value for all the files that constitute the data point. For examples that show how doing so can be useful, see the
@datapoint_byoperators in Query Syntax.
Use the following CLI command:
ml data sync yourDataVolumeID --data-path pathToYourData
You can obtain the full command, including the ID of your particular data volume, from the Wizard screen in the MissingLink web dashboard.
To display and run the sync command:
Select Wizard from the menu at the right end of the data volume to which you wish to sync data.
The following screen appears, displaying the full
ml data synccommand.
Click the Copy Command icon.
Paste the command at the command prompt and run it.
MissingLink recursively adds every file that is found in the directory or subdirectories of the path provided.
The progress of the sync is displayed in the terminal.
Once the data is synced, you can view it before committing it to the Data Volume. This staging process allows you to not only review what changes will happen before you accept the sync, but it also gives you the opportunity to add a comment to clarify what changes are taking place.
When you are satisfied with the staged changes, commit them to create a new version of your data.
For more information on staging and versioning, see Data Version Control.
ml data synccommand syncs only the changes that are not yet in the data volume. For example, if you sync a directory once and then change one file and sync again, only the changed file is uploaded to the data volume.
- If you add a metadata field that was already synced with a different casing, the original casing is used. For example, if a data volume already had a metadata field
Dog, and a subsequent sync contains a metadata field
dog, the new metadata is saved using the original
- For a full description of the
ml data synccommand and the flags available, see the CLI reference.
- There are more advanced examples of sync commands .
Performing additional syncs
Normally, following the first data sync described here, you are likely to perform additional syncs.
For every successive sync, MissingLink appends the new files it finds to the existing data and assigns the resulting data volume a unique ID. The folder being synced can contain previously synced files, or not -- the mechanism is the same in all cases.