Skip to content

Query Syntax

This page describes the syntax required to build queries you run on data volumes in the MissingLink dashboard.

The syntax rules are a subset of the Lucene Query Syntax.

A data volume query looks like this:



You can query any attribute in the data volume using wildcards.

  • An asterisk may be used to specify any number of characters. For example, to return "apple", you can specify a*, ap*, or a*e.

  • A question mark may be used to represent a single character, anywhere in the word. For example, to return "apple", you can specify ?pple, a?ple, or ap?le.

Logical query operators

Due to the complex nature of the queries, a sample query for each of the query operators has been provided below.


Used to chain two subqueries where data must satisfy both queries to be returned.

foo1:bar1 AND foo2:bar2


Used to chain two subqueries where data must satisfy either one of the two queries to be returned.

foo1:bar1 OR foo2:bar2


Used to return data that does not answer the logical condition.

NOT foo1:bar1

Range searches

Range queries match data whose field values are between the lower and upper bounds that are specified.

Range queries can be inclusive or exclusive of the upper and lower bounds. Inclusive range queries are denoted by square brackets. Exclusive range queries are denoted by curly brackets.

An alternative method for specifying range is to use the following operators:

Operator Example Equivalent
Less than < a:<10 a:{* to TO 10}
Greater than > a:>10 a:{10 TO *}
Less than or equal to <= a:<=10 a:[* TO 10]
Greater than or equal to >= a:>=10 a:[10 TO *]


  • classes:[1 TO 10]

    returns all classes from 1 to 10, inclusive

  • classes:{1 TO 10}

    returns all classes from 2 to 9

Date range searches

For querying data ranges, for example, earlier than 2019-05-27_18:31:11, a date string cannot be used. It must first be converted to a UNIX Epoch timestamp, in this instance, 1558971071. Once it is a number, regular range queries can be performed:


  • timestamp_epoch:[1558971071 TO 1558920000]

    returns items that fall in the time range between 1558971071 and 1558920000

  • timestamp_epoch:[1558971071 TO *]

    returns items that fall in the time range starting from 1558971071 up to the most recent found time

  • timestamp_epoch:[* TO 1558971071]

    returns items that fall in the time range starting from the earliest found time up to 1558971071

List searches

You can search for any attribute that appears in a given list.


The syntax requires you to enclose the options within "(" and ")" and to use space delimiters to separate the options.


filename:(name1.png name2.png name3.png)

Searching inside fields that contain arrays and dictionaries

You can search more complex structures, such as arrays and dictionaries.


1) To query data points by field of type array that contains at least the value a.

Suppose the field is named array_field and has value [a,b,c], run:


2) To query data points by field of type dictionary that contains at least a key-value pair b:value2.

Suppose the dictionary is named json_field and contains the pairs a:value1, b:value2, c:value3, run:


3) To query for empty array fields, or missing values in a dictionary:

  • For an empty array field, run array_field.:None.
  • For a key in a dictionary-type field without a value, run: json_field.key:None.

Special query operators

@path operator

Used to return file paths.

To query all the jpg files in the data volume whose paths start with "\folder\", use the following query:

foo1:bar1 @path:\folder\*\*.jpg

@sample operator

When used together with a float between 0 and 1, indicates the approximate size of the sample data you wish to return.

The following sample command gives you approximately 10% of the data returned from the query.

foo1:bar1 @sample:0.1

@version operator

The commit version ID of the data volume to be queried. When you provide a commit version ID, the data that is queried will be up to and including the commit version ID, that is, it will include the data points added in every commit before the specific commit.

foo1:bar1 @version:yourCommitVersionID


Note that if a version ID is not specified, the query will default to querying the staging version.

@size operator

Used to return files of the size queried, in bytes.

The following sample command returns the files whose size is larger than 100000 bytes:

foo1:bar1 @size:>100000

Output format operators

@split operator

Splits the data into two or three subset: train, test, and (optionally) validation.

There are two methods of using the @split operator:

  • Split by percentage

    Specify a set of three floats to split the data returned from the query into train, test and validation datasets. You need to make sure that the combined value of the provided floats is not more than 1.

    # @split:train:test[:validation]
    foo1:bar1 @split:0.7:0.3
    foo1:bar1 @split:0.5:0.4:0.1


    • If you don't specify a @split, the data is returned in the same structure that it was uploaded to the data volume.
    • If you require only train and validation subsets, specify 0.0 for test data, so:
     `foo1:bar1 @split:0.7:0.0:0.3`
  • Split based on a specific, queryable metadata attribute

    In some datasets, the test data is curated and there is metadata that indicates if the data point belongs to the test dataset or not. To split based on this attribute, you need to add the attribute to the queryable metadata, then provide the attribute name to the @split operator.

    foo1:bar1 @split:origSplit

@group_by operator

Used to group several data files into one data point such as an image with its annotation file. The @group_by operator gets a specific metadata field to group by (usually this field should contain the entity id).

foo1:bar1 @group_by:fooID @split:0.7:0.2:0.1


The @group_by operator helps to ensure that complementary data will always be together, regardless of whether they have been operated on by @split, @limit, or @sample.

To effectively use the @group_by query operator, you should add metadata to the complementary data that you can use to group them together. Using the above sample command as an example, you should add a fooID metadata to the data such that complementary data have the same fooID.

@datapoint_by operator

This operator is used for data iterators.

Using this operator, you can control the structure of the files that will be considered as a single data point. Each new fetch of the iterator returns a vector of the batch and each cell in the vector is itself another vector of files that make up the data point. The @datapoint_by operator gets a specific metadata field to create the data point (usually this field should contain the entity id).

foo1:bar1 @datapoint_by:fooID 


If you are using both @group_by and @datapoint_by operators, make sure that the metadata fields used by the different operators are not in conflict. A conflict occurs when the value of the metadata field used for the @datapoint_by corresponds to at least two values of the metadata field used for @group_by.

Control operators

@seed operator

An integer seed value that can be specified to ensure that the random seed of the query is fixed. By default, the @seed is set to 1337.

foo1:bar1 @seed:213


  • @seed affects several other operators that have a random action such as: @limit, @split, or @sample.

  • If you want to generate a different @sample, use a different @seed value.

  • The @seed value is saved for each query in the system, thus enabling the reproduction of the exact data points whenever you run the query.