Streaming Data Directly Into the Neural Network Training Process Using Native Iterators
Besides cloning data to the training machines using the MissingLink CLI, you can use native iterators. Data is streamed to the training machine during training and is cached locally. Using this method, the training starts immediately and there is no need to wait for data to be copied to the local disk.
Checkfor a full example of training code with native iterators.
Click Copy Clone Command in the query page.
Extract the volume ID and the query string from the clone command.
Pass the volume ID and query string to the code to the training process as environment variables:
parser = argparse.ArgumentParser(description='Data Iterator Sample') parser.add_argument('--datavolume', required=True) parser.add_argument('--query', required=True) args = parser.parse_args() volume_id = args.datavolume query = args.query
Create a data generator:
data_generator = missinglink_project.bind_data_generator( volume_id, query, deserialization_callback, batch_size=10 ) def deserialization_callback(file_names, metadatas): # if random.randint(0, 100) % 10: # print('filesnames %s' % filename) # print('metadata %s' % metadata) # we load the image and reshape it to a vector filename, = filenames metadata, = metadatas x = read_image(filename) x = datagen.random_transform(x) # convert the class number to one hot y = one_hot(int(metadata['label_index'])) return x, y
This method will be called for each data point. In this case, this is coming from CIFAR10 and will include one file in the
file_namesand the respected metadata, which includes a property called
rtype: Tuple with inputs of the model. A
NumPyarray with the image and a one hot vector for the class.
filenames: Tuple with the filenames for the data point.
metadatas: Tuple with the metadatas for the filenames.
deserialization_callbackis a method of your implementation that is responsible for transferring the data from the stream of files to the input layer. Data augmentation should be implemented as part of this method as well.
There is no limit to the number of data generators that a process can use. For example, you can use one query for the neural network training and validation data and another query from a different dataset for the test data.
Create the data iterators from the generator:
train_generator, test_generator, validation_generator = data_generator.flow()
The number of iterators that is created is determined by the @split operator in the query. For more information about @split, consult the Query Syntax page.
Every object in the iterator is a matrix representing a batch of data after the
If your data point consists of more than one file, you must use the @datapoint_by operator. For more information, consult the Query Syntax page.