The MissingLink query syntax uses a subset of the Lucene Query Syntax.
A query looks like this:
You can query any attribute in the data volume using wildcards.
An asterisk may be used to specify any number of characters. For example, to return "apple", you can specify
A question mark may be used to represent a single character, anywhere in the word. For example, to return "apple", you can specify
Logical Query Operators
Due to the complex nature of the queries, we have provided a sample query for each of the query operators below.
Used to chain two subqueries where data must satisfy both queries to be returned.
foo1:bar1 AND foo2:bar2
Used to chain two subqueries where data must satisfy either one of the two queries to be returned.
foo1:bar1 OR foo2:bar2
Used to return data that does not answer the logical condition.
Special Query Operators
Used to return file paths.
To query all the jpg files in the data volume whose paths start with "\folder\", use the following query:
When used together with a float between 0 and 1, indicates the approximate size of the sample data you wish to return.
The following sample command gives you approximately 10% of the data returned from the query.
The commit version ID of the data volume to be queried. When you provide a commit version ID, the data that is queried will be up to and including the commit version ID, that is, it will include the data points added in every commit before the specific commit.
Note that if a version ID is not specified, the query will default to querying the staging version.
Output Format Operators
Splits the data into train, test, and validation subsets.
There are two methods of using the @split operator:
Split by percentage
Specify a set of three floats to split the data returned from the query into train, test and validation datasets. You need to make sure that the combined value of the provided floats is not more than 1.
# @split:train:test[:validation] foo1:bar1 @split:0.5:0.4:0.1
If you don't specify a @split, the data is returned in the same structure that it was uploaded to the data volume.
Split based on a specific, queryable metadata attribute
In some datasets, the test data is curated and there is metadata that indicates if the data point belongs to the test data set or not. To split based on this attribute, you need to add the attribute to the queryable metadata, then provide the attribute name to the @split operator.
Used to group several data files into one data point such as an image with its annotation file. The @group_by operator gets a specific metadata field to group by (usually this field should contain the entity id).
foo1:bar1 @group_by:fooID @split:0.7:0.2:0.1
The @group_by operator helps to ensure that complementary data will always be together, regardless of whether they have been operated on by @split, @limit or @sample.
To effectively use the @group_by query operator, you should add metadata to the complementary data that you can use to group them together. Using the above sample command as an example, you should add a fooID metadata to the data such that complementary data have the same fooID.
This operator is used for data iterators.
Using this operator, you can control the structure of the files that will be considered as a single data point. Each new fetch of the iterator returns a vector of the batch and each cell in the vector is itself another vector of files that make up the data point. The @datapoint_by operator gets a specific metadata field to create the data point (usually this field should contain the entity id).
If you are using both @group_by and @datapoint_by operators, make sure that the metadata fields used by the different operators are not in conflict. A conflict occurs when the value of the metadata field used for the @datapoint_by corresponds to at least two values of the metadata field used for @group_by.
An integer seed value that can be specified to ensure that the random seed of the query is fixed. By default, the @seed is set to 1337.
@seed affects several other operators that have a random action such as: @limit, @split, @sample.
If you want to generate a different @sample, use a different @seed value.
The @seed value is saved for each query in the system, thus enabling the reproduction of the exact data points whenever you run the query.