Skip to content

Streaming Data Directly Into the Neural Network Training Process Using Native Iterators

Besides cloning data to the training machines using the MissingLink CLI, you can use native iterators. Data is streamed to the training machine during training and is cached locally. Using this method, the training starts immediately and there is no need to wait for data to be copied to the local disk.

Check here for a full example of training code with native iterators.

  1. Click Copy Clone Command in the query page.

    Step 1

  2. Extract the volume ID and the query string from the clone command.

    Step 2

  3. Pass the volume ID and query string to the code to the training process as environment variables:

    parser = argparse.ArgumentParser(description='Data Iterator Sample')
    parser.add_argument('--datavolume', required=True)
    parser.add_argument('--query', required=True)
    args = parser.parse_args()
    volume_id = args.datavolume
    query = args.query
  4. Create a data generator:

    data_generator = missinglink_project.bind_data_generator(
        volume_id, query, deserialization_callback, batch_size=10
    def deserialization_callback(file_names, metadatas):
        # if random.randint(0, 100) % 10:
        #     print('filesnames %s' % filename)
        #     print('metadata %s' % metadata)
        # we load the image and reshape it to a vector
        filename, = filenames
        metadata, = metadatas
        x = read_image(filename)
        x = datagen.random_transform(x)
        # convert the class number to one hot
        y = one_hot(int(metadata['label_index']))
        return x, y


    This method will be called for each data point. In this case, this is coming from CIFAR10 and will include one file in the file_names and the respected metadata, which includes a property called label_index.


    • rtype: Tuple with inputs of the model. A NumPy array with the image and a one hot vector for the class.
    • filenames: Tuple with the filenames for the data point.
    • metadatas: Tuple with the metadatas for the filenames.


    • The deserialization_callback is a method of your implementation that is responsible for transferring the data from the stream of files to the input layer. Data augmentation should be implemented as part of this method as well.

    • There is no limit to the number of data generators that a process can use. For example, you can use one query for the neural network training and validation data and another query from a different dataset for the test data.

  5. Create the data iterators from the generator:

    train_generator, test_generator, validation_generator = data_generator.flow()


    • The number of iterators that is created is determined by the @split operator in the query. For more information about @split, consult the Query Syntax page.

    • Every object in the iterator is a matrix representing a batch of data after the deserialization_callback.

    • If your data point consists of more than one file, you must use the @datapoint_by operator. For more information, consult the Query Syntax page.