Specifying Inputs and Outputs for Jobs
This section describes how to specify various inputs and outputs for running jobs.
Getting logs from the job
When a job is submitted, you can see its progress and output in real time by selecting Queues > job_name > Logs in the MissingLink dashboard or by running the job with the
--attach flag. When a job is submitted with the
--attach flag, instead of exiting after the submission, the command waits and prints job logs in the command line:
ml run xp --attach [...] 2018-02-31T15:43:32+03:00: [Run Code INFO] Hello, World! [...]
You can also attach a log to a job after it has been submitted by running the
ml run logs command.
Providing code to the job
When submitting a job, you can also provide a reference to your source code. The steps for configuring your environment to support code tracking are provided in Setting up Code Tracking in MissingLink.
Providing data to your jobs
You can use the
--data-query flags with the
ml run xp command to specify data to be provided to the job.
The data can be cloned into the
/data folder of the job host or be made available as a Python
iterator object for your experiment.
Using data volumes is not mandatory for resource management. Still, using them enables many advanced features, such as:
- Consistent Queries: Data queries include version and random seed value. As a result, the same query is guaranteed to return the same results in the same order every time it is executed.
- Caching: If a file was downloaded during a previous execution of the job (regardless of the query used), it will not be fetched again.
- Iterator: Using data iterators allows you to start training while the data is still downloading.
- Data reproducibility: Just as with source tracking, you can always see the data query used by a job or an experiment from the MissingLink dashboard, compare the datasets used in different jobs, and reproduce the data in your computer using
ml data clone.
In the following example, data is provided as command line arguments but all of these values can be saved to your recipe file.
ml run xp --data-query '@version:123 @split:0.6:0.2:0.2 @datapoint_by:bucket AND (type:Annotation OR type:Image) @seed:132--data-volume 312
--data-destflag to produce a custom folder structure based on the metadata of the query files.
- For more information on the query syntax, see Query Syntax.
- For more information on cloning data, see Cloning Data.
Exporting data from jobs
You can generate experiment artifacts during the execution of a job. These artifacts are attached to the experiments themselves and can be inspected or queried in the same way that data is on a Data Volume.
As the files in the output paths are treated as data volumes files, you can also create
FILENAME.metadata.json files containing metadata for your files. For more information, see Advanced metadata editing.
For more detailed information, see Experiment Artifacts.
Working with preloaded data and persistent paths
MissingLink supports the import of preloaded data.
While the MissingLink Data Management feature is very useful in doing heavy lifting of data, some users might have the data already preloaded to their resource management servers, whether by using custom AMI in AWS, local NFS, or other method.
--persistent-path SOURCE TARGET flag in such a case. A persistent path is a mount from
SOURCE path on the server that is hosting the job execution to the
TARGET path of the docker container that executes the job.
ml run xp --persistent-path '/mnt/external/input' '/input' --persistent-path '/mnt/external/results' '/results'
/mnt/external/inputbecomes available as
/inputinside the container.
/mnt/external/resultsbecomes available as
/resultsinside the container.
This feature simply gives you access to the path. It is up to you to:
- Make sure you don't override any files.
- Track changes made to the files.
- Track what data was available to every job.
Getting data from other sources
As you can provide multiple commands, scripts and any executable, you are free to load and persist your data to and from various sources:
- For ftp or http: You can use wget, ftp commands and others to download the data.
- For sftp or rsync: The default organization key or the certificate that is specified using
--git-identity PATHwill be available as the default
- For preloaded data in the docker image: When working with custom images you can add arbitrary data to the docker image and the data will be available to all jobs started with that image.
When submitting a job you can provide environment variables in standard or encrypted mode:
- Standard: Environment variables are available to MissingLink (to your organization only!). This means we will show them as part of the job and experiment information, comparison, and when running in AWS the
clearenvironment variables will be used as instance tags. Unless the environment variable contains sensitive data such as passwords, tokens and so forth, you should use this option as it will give you clarity and comparability. Also, standard environment variables can be stored in the recipe file. You can provide such variables using the
--env KEY VALUEparameter.
- Encrypted: Encrypted environment variables are encrypted before leaving your server and are unreadable by MissingLink. Secure environment variables can't be seen in your dashboard or the dashboard of others in your organization and they are not used as instance tags. Secure environment variables are submitted using the
--secure-env KEY VALUEparameter that can be provided more than once.
For more information regarding MissingLink's encryption procedures, see Using confidential data.
Encrypted environment variables can not be stored in the recipe file. It is recommended you pass them as environment variables to avoid having them stored in you bash history file, so:
ml run xp --secure-env password $PASSWORD.