Specifying Inputs and Outputs for Jobs
This section describes how to specify various inputs and outputs for running jobs.
Getting logs from the job
When a job is submitted, you can see its progress and output in real time by clicking on the job in the MissingLink dashboard or by running the job with the
--attach flag. When a job is submitted with the
--attach flag, instead of exiting after the submission, the command will wait and print job logs in the command line:
$ ml run xp --attach [...] 2018-02-31T15:43:32+03:00: [Run Code INFO] Hello, World! [...]
You can also attach a log to a job after it has been submitted by running the
ml run logs command.
Providing code to the job
If required, when submitting a job, you can also provide a reference to your source code.
Use MissingLink's source tracking to snapshot the current state of your code and automatically provide the required parameters for the job submission or manually provide the Git repository and (optionally) the target branch or tag.
When using source tracking MissingLink commits your current changes to a side repository in your Git hosting provider. This means that all of your uncommitted changes will be available to the job and versioned so you can go back and see the code that was provided to the job. If you are not using the source tracking feature, it is your responsibility to commit and push desired changes by adding the
--git-tag flags to the command.
- The code must be provided in
httpsrepositories are not supported.
- If not provided,
- If the default organization credentials don't have access to the repo, pass
--git-identity PATHto specify an alternative path to Git credentials (SSH key).
As with any other sensitive data, the key is encrypted before leaving your computer and can't be decrypted without your organization key. For more information regarding MissingLink's encryption policy, see Using confidential data.
When using source tracking, you can use the
--source-dir PATH flag to provide the path to the directory that holds the code files and tracking repository configuration.
So, assuming you don't have other instructions in your recipe file,
$ ml run xp
is equivalent to:
$ ml run xp --source-dir .
$ ml run xp --git-repo [email protected]:missinglinkai/empty.git
is equivalent to:
$ ml run xp --git-repo [email protected]:missinglinkai/empty.git --git-tag master
In addition, you can specify all of the arguments explicitly:
$ ml run xp --git-repo [email protected]:missinglinkai/empty.git --git-tag my_tests_branch --git-identity ~/.ssh/id_rsa
Git LFS and submodules are supported as long as the submodules are in SSH format or hosted in GitHub or BitBucket (where MissingLink converts https formats to Git formats automatically).
Providing data to your jobs
You can use the
--data-query flags with the
ml run xp command to specify data to be provided to the job.
The data can be cloned into the
/data folder of the job host or be made available as a python
iterator object for your experiment.
Using data volumes is not mandatory for resource management. Still, using them enables many advanced features, such as:
- Consistent Queries: Data queries include version and random seed value. As a result, the same query is guaranteed to return the same results in the same order every time it is executed.
- Caching: If a file was downloaded during a previous execution of the job (regardless of the query used), it will not be fetched again.
- Iterator: Using data iterators allows you to start training while the data is still downloading.
- Data reproducibility: Just as with source tracking, you can always see the data query used by a job or an experiment from the MissingLink dashboard, compare the data sets used in different jobs, and reproduce the data in your computer using
ml data clone.
In the following example, data is provided as command line arguments but all of these values can be saved to your recipe file.
ml run xp --data-query '@version:123 @split:0.6:0.2:0.2 @datapoint_by:bucket AND (type:Annotation OR type:Image) @seed:132--data-volume 312
You can use the MissingLink dashboard to generate and test the queries. You can also clone the data locally to test it and use the
--data-dest flag to produce a custom folder structure based on the metadata of the query files.
- For more information on the query syntax, see Query Syntax.
- For more information on cloning data, see Cloning Data.
Exporting data from jobs
When integrated with MissingLink's Data Management feature, you can use the
--output-paths parameter to specify a folder that will be exported at the end of the job.
$ ml run xp --command 'ls -al > /output/ls.txt'
The default output path is
/output. You can specify more than one path.
$ ml run xp --command 'ls -al > /results/ls.txt' --output-paths '/results'
Once the job is completed (or has been stopped), any files found in your
output paths will be saved to the Artifact Management.
As the files in the output paths are treated as data volumes files, you can also create
FILENAME.metadata.json files containing metadata for your files. For more information, see Advanced metadata editing.
Working with preloaded data and persistent paths
MissingLink supports the import of preloaded data.
While the MissingLink Data Management feature is very useful in doing heavy lifting of data, some users might have the data already preloaded to their resource management servers, whether by using custom AMI in AWS, local NFS, or other method.
--persistent-path SOURCE TARGET flag in such a case. A persistent path is a mount from
SOURCE path on the server that is hosting the job execution to the
TARGET path of the docker container that executes the job.
$ ml run xp --persistent-path '/mnt/external/input' '/input' --persistent-path '/mnt/external/results' '/results'
exposes two paths from your external storage:
/mnt/external/inputbecomes available as
/inputinside the container.
/mnt/external/resultsbecomes available as
/resultsinside the container.
This feature simply gives you access to the path. It is up to you to:
- Make sure you don't override any files.
- Track changes made to the files.
- Track what data was available to every job.
Getting data from other sources
As you can provide multiple commands, scripts and any executable, you are free to load and persist your data to and from various sources:
- For ftp or http: You can use wget, ftp commands and others to download the data.
- For sftp or rsync: The default organization key or the certificate that is specified using
--git-identity PATHwill be available as the default
- For preloaded data in the docker image: When working with custom images you can add arbitrary data to the docker image and the data will be available to all jobs started with that image.
When submitting a job you can provide environment variables in standard or encrypted mode:
- Standard: Environment variables are available to MissingLink (to your organization only!). This means we will show them as part of the job and experiment information, comparison, and when running in AWS the
clearenvironment variables will be used as instance tags. Unless the environment variable contains sensitive data such as passwords, tokens and so forth, you should use this option as it will give you clarity and comparability. Also, standard environment variables can be stored in the recipe file. You can provide such variables using the
--env KEY VALUEparameter.
- Encrypted: Encrypted environment variables are encrypted before leaving your server and are unreadable by MissingLink. Secure environment variables can't be seen in your dashboard or the dashboard of others in your organization and they are not used as instance tags. Secure environment variables are submitted using the
--secure-env KEY VALUEparameter that can be provided more than once.
For more information regarding MissingLink's encryption procedures, see Using confidential data.
Encrypted environment variables can not be stored in the recipe file. It is recommended you pass them as environment variables to avoid having them stored in you bash history file, so:
ml run xp --secure-env password $PASSWORD.