Skip to content

Specifying Inputs and Outputs for Jobs

This section describes how to specify various inputs and outputs for running jobs.

Getting logs from the job

When a job is submitted, you can see its progress and output in real time by clicking on the job in the MissingLink dashboard or by running the job with the --attach flag. When a job is submitted with the --attach flag, instead of exiting after the submission, the command will wait and print job logs in the command line:

$ ml run xp --attach
[...]
2018-02-31T15:43:32+03:00: [Run Code INFO] Hello, World!
[...]

You can also attach a log to a job after it has been submitted by running the ml run logs command.

Providing code to the job

If required, when submitting a job, you can also provide a reference to your source code.

Use MissingLink's source tracking to snapshot the current state of your code and automatically provide the required parameters for the job submission or manually provide the Git repository and (optionally) the target branch or tag.

When using source tracking MissingLink commits your current changes to a side repository in your Git hosting provider. This means that all of your uncommitted changes will be available to the job and versioned so you can go back and see the code that was provided to the job. If you are not using the source tracking feature, it is your responsibility to commit and push desired changes by adding the --git-repo and --git-tag flags to the command.

  • The code must be provided in git ssh format. https repositories are not supported.
  • If not provided, --git-tag defaults to master.
  • If the default organization credentials don't have access to the repo, pass --git-identity PATH to specify an alternative path to Git credentials (SSH key).

Note

As with any other sensitive data, the key is encrypted before leaving your computer and can't be decrypted without your organization key. For more information regarding MissingLink's encryption policy, see Using confidential data.

When using source tracking, you can use the --source-dir PATH flag to provide the path to the directory that holds the code files and tracking repository configuration.

So, assuming you don't have other instructions in your recipe file,

$ ml run xp 

is equivalent to:

$ ml run xp --source-dir . 

and

$ ml run xp --git-repo [email protected]:missinglinkai/empty.git

is equivalent to:

$ ml run xp --git-repo [email protected]:missinglinkai/empty.git --git-tag master

In addition, you can specify all of the arguments explicitly:

$ ml run xp --git-repo [email protected]:missinglinkai/empty.git --git-tag my_tests_branch --git-identity ~/.ssh/id_rsa

Note

Git LFS and submodules are supported as long as the submodules are in SSH format or hosted in GitHub or BitBucket (where MissingLink converts https formats to Git formats automatically).

Providing data to your jobs

You can use the --data-volume and --data-query flags with the ml run xp command to specify data to be provided to the job. The data can be cloned into the /data folder of the job host or be made available as a python iterator object for your experiment.

Using data volumes is not mandatory for resource management. Still, using them enables many advanced features, such as:

  • Consistent Queries: Data queries include version and random seed value. As a result, the same query is guaranteed to return the same results in the same order every time it is executed.
  • Caching: If a file was downloaded during a previous execution of the job (regardless of the query used), it will not be fetched again.
  • Iterator: Using data iterators allows you to start training while the data is still downloading.
  • Data reproducibility: Just as with source tracking, you can always see the data query used by a job or an experiment from the MissingLink dashboard, compare the data sets used in different jobs, and reproduce the data in your computer using ml data clone.

In the following example, data is provided as command line arguments but all of these values can be saved to your recipe file.

ml run xp --data-query '@version:123 @split:0.6:0.2:0.2 @datapoint_by:bucket AND (type:Annotation OR type:Image) @seed:132--data-volume 312

You can use the MissingLink dashboard to generate and test the queries. You can also clone the data locally to test it and use the --data-dest flag to produce a custom folder structure based on the metadata of the query files.

Exporting data from jobs

When integrated with MissingLink's Data Management feature, you can use the --output-paths parameter to specify a folder that will be exported at the end of the job.

$ ml run xp --command 'ls -al > /output/ls.txt' 

The default output path is /output. You can specify more than one path.

$ ml run xp --command 'ls -al > /results/ls.txt'  --output-paths '/results'

Once the job is completed (or has been stopped), any files found in your output paths will be saved to the Artifact Management.

As the files in the output paths are treated as data volumes files, you can also create FILENAME.metadata.json files containing metadata for your files. For more information, see Advanced metadata editing.

Working with preloaded data and persistent paths

MissingLink supports the import of preloaded data.

While the MissingLink Data Management feature is very useful in doing heavy lifting of data, some users might have the data already preloaded to their resource management servers, whether by using custom AMI in AWS, local NFS, or other method.

Use the --persistent-path SOURCE TARGET flag in such a case. A persistent path is a mount from SOURCE path on the server that is hosting the job execution to the TARGET path of the docker container that executes the job.

The command:

$ ml run xp --persistent-path '/mnt/external/input' '/input' --persistent-path '/mnt/external/results' '/results' 

exposes two paths from your external storage:

  • /mnt/external/input becomes available as /input inside the container.
  • /mnt/external/results becomes available as /results inside the container.

This feature simply gives you access to the path. It is up to you to:

  • Make sure you don't override any files.
  • Track changes made to the files.
  • Track what data was available to every job.

Getting data from other sources

As you can provide multiple commands, scripts and any executable, you are free to load and persist your data to and from various sources:

  • For ftp or http: You can use wget, ftp commands and others to download the data.
  • For sftp or rsync: The default organization key or the certificate that is specified using --git-identity PATH will be available as the default ssh key.
  • For preloaded data in the docker image: When working with custom images you can add arbitrary data to the docker image and the data will be available to all jobs started with that image.

Environment variables

When submitting a job you can provide environment variables in standard or encrypted mode:

  • Standard: Environment variables are available to MissingLink (to your organization only!). This means we will show them as part of the job and experiment information, comparison, and when running in AWS the clear environment variables will be used as instance tags. Unless the environment variable contains sensitive data such as passwords, tokens and so forth, you should use this option as it will give you clarity and comparability. Also, standard environment variables can be stored in the recipe file. You can provide such variables using the --env KEY VALUE parameter.
  • Encrypted: Encrypted environment variables are encrypted before leaving your server and are unreadable by MissingLink. Secure environment variables can't be seen in your dashboard or the dashboard of others in your organization and they are not used as instance tags. Secure environment variables are submitted using the --secure-env KEY VALUE parameter that can be provided more than once.

For more information regarding MissingLink's encryption procedures, see Using confidential data.

Note

Encrypted environment variables can not be stored in the recipe file. It is recommended you pass them as environment variables to avoid having them stored in you bash history file, so: ml run xp --secure-env password $PASSWORD.