Skip to content

Specifying Inputs and Outputs for Jobs

This section describes how to specify various inputs and outputs for running jobs.

Getting logs from the job

When a job is submitted, you can see its progress and output in real time by selecting Queues > job_name > Logs in the MissingLink dashboard or by running the job with the --attach flag. When a job is submitted with the --attach flag, instead of exiting after the submission, the command waits and prints job logs in the command line:

ml run xp --attach
[...]
2018-02-31T15:43:32+03:00: [Run Code INFO] Hello, World!
[...]

You can also attach a log to a job after it has been submitted by running the ml run logs command.

Providing code to the job

When submitting a job, you can also provide a reference to your source code. The steps for configuring your environment to support code tracking are provided in Setting up Code Tracking in MissingLink.

Providing data to your jobs

You can use the --data-volume and --data-query flags with the ml run xp command to specify data to be provided to the job. The data can be cloned into the /data folder of the job host or be made available as a Python iterator object for your experiment.

Using data volumes is not mandatory for resource management. Still, using them enables many advanced features, such as:

  • Consistent Queries: Data queries include version and random seed value. As a result, the same query is guaranteed to return the same results in the same order every time it is executed.
  • Caching: If a file was downloaded during a previous execution of the job (regardless of the query used), it will not be fetched again.
  • Iterator: Using data iterators allows you to start training while the data is still downloading.
  • Data reproducibility: Just as with source tracking, you can always see the data query used by a job or an experiment from the MissingLink dashboard, compare the datasets used in different jobs, and reproduce the data in your computer using ml data clone.

In the following example, data is provided as command line arguments but all of these values can be saved to your recipe file.

ml run xp --data-query '@version:123 @split:0.6:0.2:0.2 @datapoint_by:bucket AND (type:Annotation OR type:Image) @seed:132--data-volume 312
You can use the MissingLink dashboard to generate and test the queries. You can also clone the data locally to test it and use the --data-dest flag to produce a custom folder structure based on the metadata of the query files.

Exporting data from jobs

You can generate experiment artifacts during the execution of a job. These artifacts are attached to the experiments themselves and can be inspected or queried in the same way that data is on a Data Volume.

As the files in the output paths are treated as data volumes files, you can also create FILENAME.metadata.json files containing metadata for your files. For more information, see Advanced metadata editing.

For more detailed information, see Experiment Artifacts.

Working with preloaded data and persistent paths

MissingLink supports the import of preloaded data.

While the MissingLink Data Management feature is very useful in doing heavy lifting of data, some users might have the data already preloaded to their resource management servers, whether by using custom AMI in AWS, local NFS, or other method.

Use the --persistent-path SOURCE TARGET flag in such a case. A persistent path is a mount from SOURCE path on the server that is hosting the job execution to the TARGET path of the docker container that executes the job.

The command:

ml run xp --persistent-path '/mnt/external/input' '/input' --persistent-path '/mnt/external/results' '/results'
exposes two paths from your external storage:

  • /mnt/external/input becomes available as /input inside the container.
  • /mnt/external/results becomes available as /results inside the container.

This feature simply gives you access to the path. It is up to you to:

  • Make sure you don't override any files.
  • Track changes made to the files.
  • Track what data was available to every job.

Getting data from other sources

As you can provide multiple commands, scripts and any executable, you are free to load and persist your data to and from various sources:

  • For ftp or http: You can use wget, ftp commands and others to download the data.
  • For sftp or rsync: The default organization key or the certificate that is specified using --git-identity PATH will be available as the default ssh key.
  • For preloaded data in the docker image: When working with custom images you can add arbitrary data to the docker image and the data will be available to all jobs started with that image.

Environment variables

When submitting a job you can provide environment variables in standard or encrypted mode:

  • Standard: Environment variables are available to MissingLink (to your organization only!). This means we will show them as part of the job and experiment information, comparison, and when running in AWS the clear environment variables will be used as instance tags. Unless the environment variable contains sensitive data such as passwords, tokens and so forth, you should use this option as it will give you clarity and comparability. Also, standard environment variables can be stored in the recipe file. You can provide such variables using the --env KEY VALUE parameter.
  • Encrypted: Encrypted environment variables are encrypted before leaving your server and are unreadable by MissingLink. Secure environment variables can't be seen in your dashboard or the dashboard of others in your organization and they are not used as instance tags. Secure environment variables are submitted using the --secure-env KEY VALUE parameter that can be provided more than once.

For more information regarding MissingLink's encryption procedures, see Using confidential data.

Note

Encrypted environment variables can not be stored in the recipe file. It is recommended you pass them as environment variables to avoid having them stored in you bash history file, so: ml run xp --secure-env password {PASSWORD}.