When training RNNs on sequences, there is a need to batch sequences that have different lengths into batches. This allows the RNN trainer to execute one batch at a time with the same graph, that is constructed for one specific sequence length.
The mechanics of batching sequences, padding them, training and then testing is a bit confusing at first. This notebook illustrates how to perform this mini-batching process in DyNet.
The task we illustrate is a sequence classifier: we map sequences to a label.
The input consists of pairs $(s_j, label_j)$ where $s_j$ are sequences of words in a vocabulary of size VOCAB_SIZE and $label_j$ are labels in a label set of size NUM_OF_CLASSES. Each sequence of words $s_j$ contains words $w_{ij}, i \in 1...l_j]$.
The architecture consists of the following steps:
import dynet as dn
from time import time
from numpy import mean, argmax
VOCAB_SIZE = 2
EMBEDDINGS_SIZE = 10
LSTM_NUM_OF_LAYERS = 1
STATE_SIZE = 10
NUM_OF_CLASSES = 2
REPEATS = 1000
BATCH_SIZE = 2
For this example, we create a synthetic random dataset, mapping random sequences to labels. The sequences are of variable length from a uniform distribution of lengths [1..max_seq].
from random import randint
def gen_random_dataset(num_of_examples, max_seq, vocab_size, num_of_classes):
X = []
Y = []
for _ in range(num_of_examples):
seq = []
for _ in range(randint(1, max_seq-1)):
seq.append(randint(0, vocab_size-1))
X.append(seq)
Y.append(randint(0, num_of_classes-1))
return X, Y
def print_dataset(X, Y):
for seq, label in zip(X, Y):
print('label:', label, 'seq:', seq)
X, Y = gen_random_dataset(10, 10, VOCAB_SIZE, NUM_OF_CLASSES)
print_dataset(X, Y)
This part is the key technical aspect of this notebook.
We split our dataset into batches so that we can train each batch with a single RNN of a fixed size.
Firstly, we sort the dataset by length so that batches regroup sequences of similar lengths together. The effect is that the length variance of each batch will be minimal.
from numpy import ceil
def to_batch(X, Y, batch_size):
#sort dataset by length
data = list(zip(*sorted(zip(X,Y), key=lambda x: len(x[0]))))
batched_X = []
batched_Y = []
for i in range(int(ceil(len(X)/batch_size))):
batched_X.append(data[0][i*batch_size:(i+1)*batch_size])
batched_Y.append(data[1][i*batch_size:(i+1)*batch_size])
return batched_X, batched_Y
batched_X, batched_Y = to_batch(X, Y, BATCH_SIZE)
print("Dataset in batches of size ", BATCH_SIZE)
print_dataset(batched_X, batched_Y)
We now must ensure that in each batch, all sequences will have the same length. We achieve this by padding the beginning of each sequence with a new vocabulary token which serves as a special code for "padding".
#add new padding token to the vocabulary
VOCAB_SIZE += 1
def pad_batch(batch):
max_len = len(batch[-1])
padded_batch = []
for x in batch:
x = [VOCAB_SIZE-1]*(max_len-len(x)) + x
padded_batch.append(x)
return padded_batch
#pad batches
batched_X_padded = list(map(pad_batch, batched_X))
print_dataset(batched_X_padded, batched_Y)
Our model is simple: embed each item, run a RNN over all the embedded items, use the last output to classify the sequence with a softmax layer.
model = dn.Model()
input_lookup = model.add_lookup_parameters((VOCAB_SIZE, EMBEDDINGS_SIZE))
lstm = dn.LSTMBuilder(LSTM_NUM_OF_LAYERS, EMBEDDINGS_SIZE, STATE_SIZE, model)
output_w = model.add_parameters((NUM_OF_CLASSES, STATE_SIZE))
output_b = model.add_parameters((NUM_OF_CLASSES))
# Return the model prediction over a batch - each prediction is a vector (p_i) of dim NUM_OF_CLASSES
# As usual we do not apply the softmax here - but in the loss function.
def get_probs(batch):
dn.renew_cg()
# The I iteration embed all the i-th items in all batches
embedded = [dn.lookup_batch(input_lookup, chars) for chars in zip(*batch)]
state = lstm.initial_state()
output_vec = state.transduce(embedded)[-1]
w = dn.parameter(output_w)
b = dn.parameter(output_b)
return w*output_vec+b
Since our function returns scores over a batch, we use 'pickneglogsoftmax_batch' to get the probabilty of all the correct labels over the batch and 'sum_batches' the sum the errors over all these probabilties.
def train(trainX, trainY):
print('starting train')
trainer = dn.AdamTrainer(model)
for _ in range(REPEATS):
for X, Y in zip(trainX, trainY):
probs = get_probs(X)
loss = dn.sum_batches(dn.pickneglogsoftmax_batch(probs, Y))
loss_value = loss.value()
loss.backward()
trainer.update()
print('done training!')
In order to predict over a batch, we use the numpy value of our probabilities (using 'npvalue') and argmax to get the index of the most probable label.
def validate(testX, testY):
print('starting validation')
acc = []
for X, Y in zip(testX, testY):
probs = get_probs(X).npvalue()
for i in range(len(probs[0])):
pred = argmax(probs[:, i])
label = Y[i]
if pred == label:
acc.append(1)
else:
acc.append(0)
print('accuracy: ', mean(acc))
validate(batched_X_padded, batched_Y)
train(batched_X_padded, batched_Y)
validate(batched_X_padded, batched_Y)
That's it: our model learned to classify the sequences.
The method of batching + padding is general whenever we use RNNs.