Skip to content

Cloning Data Volumes

You can clone data volumes either through the MissingLink dashboard or by using MissingLink CLI commands.

Another method of supplying data to the training machine during training is to stream it using iterators. For more information, see Streaming Data Directly Into the Training Process Using Native Iterators.

You can use the MissingLink dashboard to clone data to a new location, based on the needs of your experiment.

For example, let's say you want to use a query that you performed on your data to create a sub-dataset for local testing. Once you have built the query to your satisfaction, simply click the Copy Clone icon to the left of the Run button to get a clone command.

Step 1

Supposing the query you ran is:

Follow_up_Number:<5 and Patient_Age:>18 and Patient_Age:<55 and Patient_Gender:F @sample:0.1 @split:0.6:0.2:0.2

then the resulting clone command will be:

ml data clone 5681801511043072 --query "(@version:7469f908300e688a9bcbf1f37cf3fa44d576798c) AND (Follow_up_Number:<5 and Patient_Age:>18 and Patient_Age:<55 and Patient_Gender:F @sample:0.1 @split:0.6:0.2:0.2) @seed:1337" --dest-folder ./

Note

The clone command must be executed on the specific machine where you wish to access the cloned data. If you move to another machine, you must execute the command again on the other machine to gain access to the cloned data.

You can also clone data by using the following MissingLink command:

ml data clone

For more information on the command and the flags available, see the CLI reference.

There are several MissingLink variables that the MissingLink CLI clone command can translate automatically. These keywords can be used in the --dest-folder and --dest-file flags.

For more information, see MissingLink variables with special meaning for cloning.

Examples

For the purpose of the following examples, suppose the dataset contains data points with a single attribute in the metadata named type_of_animal that has the values: Dog, Cat, and Fish.

1) Run the following command:

ml data clone  --query '@version:versionID 
    AND @sample:0.2 AND @split:0.5:0.25:0.25 @seed:1337' \
    --dest-folder '/destinationPath/$@/'
to create three folders under the destinationPath named train, test, and validation and copy the data points according to the @split ratio to each folder.

For more information on using the @sample, @split, and @seed operators in CLI commands, see Special Query Operators.

2) Run:

ml data clone  --query '@version:versionID
    AND @sample:0.2 AND @split:0.5:0.25:0.25 @seed:1337' \
    --dest-folder '/destinationPath/$@/' --dest-file '$name' 

to generate the original filename for each data point copied.

3) Run:

ml data clone  --query '@version:versionID
    AND @sample:0.2 AND @split:0.5:0.25:0.25 @seed:1337' \
    --dest-folder '/destinationPath/$@/$dir' --dest-file '$name' 

to create subfolders with the original folder structure of the data points from the sync command under the folders train, test, and validation.

4) Run:

ml data clone  --query '@version:versionID
    AND @sample:0.2 AND @split:0.5:0.25:0.25 @seed:1337' \
    --dest-folder '/destinationPath/$@/$type_of_animal' --dest-file '$name' 
to create Dog, Cat and Fish subfolders under the train, test, and validation folders and copy the relevant data points for each subfolder according to the type_of_animal attribute.