Skip to content

Data Methods

data.add

as_api().data.add(volume_id, files=None, commit=None, no_progressbar=False)

Add data to the staging area of the data volume and put the file in the storage.

Note

The method adds data to the index of the data volume and not to the metadata.

Parameters

  • volume_id: Integer. Volume ID. To get the list of volumes, including their IDs and names, call data.list.
  • files: List of strings. Names of files to add. Specify the full path for each.
  • commit: Optional. String. Indicates that after the add is complete, the new data points should be committed to a new version.
  • no_progressbar: Optional. Boolean. Show progress bar during the add process? Default is False.

data.create

as_api().data.create(org, display_name=None, description=None, bucket=None, linked=False, shared_storage_volume_id= None)

Create a data volume with the specified display name. The data volume will be attached to the specified organization.

Parameters

  • org: String. Organization to use.

  • display_name: Optional. String. Name to show in the display.

  • description: Optional. String. More detailed description of the data volume.
  • bucket: Optional. String. Name of a private bucket. This parameter is optional only if you have specified a value for shared_storage_volume_id. Specify the bucket name using the following syntax:

    • For Google cloud: gs://YourBucketName
    • For Amazon S3: s3://YourBucketName
    • For Azure storage: az://{storage_account_name}.{container_name}
    • For local storage: file://path
  • linked: Boolean. Optional. Use link or embedded mode? Options are linked and embedded.

    • When the data volume is created in embedded mode (the default), MissingLink copies all the data during sync and manages the storage in the user-assigned storage bucket. This is the default.
    • In linked mode MissingLink does not duplicate the data but stores only links to the data during sync. In this mode, the user is responsible not to delete or modify files after they were synced to the data volume.
  • shared_storage_volume_id: Optional. Integer. Id of an existing volume, whose storage the new data volume will use.

data.commit

as_api().data.commit(volume_id, message=None, isolation_token=None)

Commit files that are in the staging area to a version of the specified data volume.

Parameters

  • volume_id: Integer. Volume ID. To get the list of volumes, including their IDs and names, call data.list.
  • message: Optional. String. Message to attach to the commit.
  • isolation_token: Optional. String. Token obtained from an isolated sync.

data.metadata.add

as_api().data.metadata.add(volume_id, files=None, data=None, data_point=None, data_file=None, property=None, property_int=None, property_float=None, update=True, no_progressbar=False, data_path=None)

Attach metadata to files that are already in the data volume, or add stand-alone metadata.

Parameters

  • volume_id: Integer. Volume ID. To get the list of volumes, including their IDs and names, call data.list.
  • files: Optional. List of strings. Path to the files to which metadata will be tagged.
  • data: Optional. String. Metadata that should be tagged to the files that are being added. The metadata must be passed as a JSON structure.
  • data_point: Optional. List of strings. Specific data point that the metadata should be tagged to.
  • data_file: Optional. String. Filepath of a JSON file that describes to which data points to add metadata and the metadata that you wish to add.
  • property: Optional. A list, whose members are Tuples of two values: the first is the property name (string) and the second is the property value (string).
  • property_int: Optional. A list, whose members are Tuples of two values: the first is the property name (string) and the second is the property value (integer).
  • property_float: Optional. A list, whose members are Tuples of two values: the first is the property name (string) and the second is the property value (float).
  • update: Optional. Boolean. Where metadata is added to the same data point in the staging version and conflicts may arise, the two versions of the metadata must be merged and old metadata must be overwritten with new metadata.

    Options are: update (True) or replace (False) data. Default is True.

    Note

    The parameter is effective only to uncommitted data in the staging area of the version control, as data already committed into a version is immutable.

  • no_progressbar: Optional. Boolean. Show the progress bar during the add process? Default is False.

  • data_path: Optional. String. Path to the data.

data.sync

as_api().data.sync(volume_id, data_path, commit=None, no_progressbar=False, isolated=False)
Sync data to the specified data volume.

Notes

  • Ensure that you create the .metadata.json files in the same folder as your current dataset. The JSON file contains a flat dictionary of attributes that can have only basic type values (string, number, boolean).
  • MissingLink will recursively add every file that is found in the directory or subdirectories of the path provided.
  • The data.sync method syncs only the changes that are not yet in the data volume. For example, if you sync a directory once and then change one file and sync again, only the changed file will be uploaded to the data volume.

Parameters

  • volume_id: Integer. Volume ID. To get the list of volumes, including their IDs and names, call data.list.
  • data_path: String. Path to the data to add.
  • commit: Optional. String. Indicates that after the add is complete, the new data points should be committed to a new version. !!! note The commit takes all uncommitted changes into the same version and not only the changes in the sync command.

  • no_progressbar: Optional. Boolean. Show progress bar during the add process? Default is False.

  • isolated: Optional. Boolean. Default is False. In an isolated sync, the folder is synced and the data committed without passing the regular staging phase. Files enter an isolated staging area that is not shown in the web console.

data.clone

as_api().data.clone(volume_id, dest_folder, dest_file="[email protected]", query=None, delete=False, batch_size=-1, no_progressbar=False, isolation_token=None)

Clone data from the specified data volume.

If you do not specify a data volume:

  • If there is only one found, MissingLink uses that.
  • If there is more than one data volume, a list of those found is shown and you are prompted to choose one before the command is executed.

Parameters

  • volume_id: Integer. Volume ID. To get the list of volumes, including their IDs and names, call data.list.
  • dest_folder: String. Filepath to clone the filtered data to. Can be used with special system variables.
  • dest_file: Optional. String. File to clone the filtered data to. Can be used with special system variables. Default is [email protected].
  • query: Optional. String.
  • delete: Optional. Boolean. Should the clone action delete all existing data found under the specified destination folder? Default is False.

    Warning

    Exercise caution when using this action, as it can potentially delete things you did not mean to delete. There is no way to revert this action.

  • batch_size: Optional. Integer. Default is -1, meaning the clone process is asynchronous.

  • no_progressbar: Optional. Boolean. Show the progress bar during the add process? Default is False.
  • isolation_token: Optional. String. Token obtained from an isolated sync.

There are several special MissingLink variables that the data.clone method can translate automatically. These keywords can be used in the dest_folder and dest_file parameter

They are detailed below. An example follows.

  • [email protected]: Replaced by the phase folder that the file should be copied to, that is, the train data points will be cloned to the train folder, validation data points to the validation folder and test data points will be cloned to the test folder, along with their respective names and extensions.

    Note

    You can also use [email protected], instead, to specify [email protected] as a shortcut.

  • [email protected]: Replaced by the directory or directories of the file inside the data volume.

  • [email protected]: Replaced by the hash of the file.
  • [email protected]_name: Replaced by the name of the file, without its extension.
  • [email protected] or [email protected]: Replaced by the extension of the file.
  • [email protected]: Replaced by the [email protected]_name + [email protected] of the file.
  • [email protected]_field: Replaced by the value of the metadata field. If, for example, the user has assigned the metadata breed:poodle to the datapoint using $breed will translate to poodle for that data point.

    Example

    Assuming the data is tagged according to class:

    • 1.jpg [class:cat]
    • 2.jpg [class:dog]

    and you want to clone the data so that the files in the target are organized in folders named by class, so:

    • \dog\2.jpg
    • \cat\1.jpg

    write the following code:

    dest_folder = '$class'
    as_api().data.clone(<volume_id>, dest_folder)
    

data.query

as_api().data.query(volume_id, query=None, batch_size=-1, as_dict=False, silent=False)

Retrieve the metadata of data points that meet the query criteria.

The metadata are aggregated into a single file, as a JSON structure.

Parameters

  • volume_id: Integer. Volume ID. To get the list of volumes, including their IDs and names, call data.list.
  • query: Optional. String. The query to execute.
  • batch_size: Optional. Integer. Number of data points in each batch of data that is retrieved. Integer. Default is -1, meaning the query process is asynchronous.
  • as_dict: Optional. Boolean. Present information as a dictionary or as a list? Options are as_dict (True) and as_list (False). Default is False.
  • silent: Optional. Boolean. Suppress printing of progress? Default is False.

data.list

as_api().data.list()

List the data volumes across all organizations of which the user is a member.

data.validate

as_api().data.validate(volume_id, data_path, no_progressbar=False)

Validate data.

This action is almost the same as sync. It does not actually sync the files but only goes over them and validates the metadata files.

Parameters

  • volume_id: Integer. Volume ID. To get the list of volumes, including their IDs and names, call data.list.
  • data_path: String. Path to the data.
  • no_progressbar: Optional. Boolean. Show progress bar during the add process? Default is False.