Syncing Data Points
This topic shows you how to sync data to a data volume on MissingLink.ai.
A data point can consist of one or more files. For example, in the VOC2007 dataset, each data point can have up to four files: the raw image, an XML file with annotations, a segmentation JPG file, and a classification JPG file.
To be able to query and filter the dataset using MissingLink, you'll need to add an additional file that shares the same name as the original file but with a
.metadata.json extension. This file will contain the attributes on which you wish to query the dataset and filter the data points. In this page, we'll refer to this file as queryable metadata.
For example, if you have a file named "myfile.jpg", the queryable metadata file name will be "myfile.jpg.metadata.json".
- If your data point consists of more than one file, you will need to create a queryable metadata file for each one of the files.
- It is recommended to add an attribute to the queryable metadata file named
data_point_idand have the same value for all the files that constitute the data point. For more information, see the
@datapoint_byoperators in our Query Syntax.
Sync Data to a Data Volume With MissingLink CLI
You can copy the command from the wizard screen of MissingLink's web console:
ml data sync yourDataVolumeID --dataPath pathToYourData
There are more examples of sync commands here.
- Don't forget to create the
.metadata.jsonfiles in the same folder as your current dataset. The JSON file contains a flat dictionary of attributes that can have only basic type values (string, number, boolean).
- MissingLink will recursively add every file that is found in the directory or subdirectories of the path provided.
ml data synccommand syncs only the changes that are not yet in the data volume. For example, if you sync a directory once and then change one file and sync again, only the changed file will be uploaded to the data volume.
After syncing data to the data volume with the MissingLink CLI, you will be able to see the data in the dashboard under the data volume in the staging section.
Flags for adding data
Run the following command for viewing the flags available for the command:
ml data sync --help
dataPath: The path to the data that should be added.
commit: Indicates that after the sync is complete, the new data points should be committed to a new version.
processes: The number of processes that should be used to add the files. The default is the number of cores multipled by four.
no_progressbar: Hides the progress bar during the add process.
enable_progressbar(default): Shows the progress bar during the add process.
resume: Resumes the sync in case it failed before completing.
Resuming a failed sync command
There might be cases where the sync command fails before completing the sync. For example, there could be connectivity issues between the local machine and the cloud. To resume the sync, use the MissingLink CLI
Whenever the sync command fails, the MissingLink CLI prints a resume token.
Running the following command from the same machine continues the sync from the point at which it failed:
ml data sync yourDataVolumeID --dataPath pathToYourData --resume yourResumeToken