Listen, Attend, and Spell

Listen, Attend, and Spell is an incremental paper that shows reasonable performance on

Pyramidal RNNs appear never to have caught on, even though it appears that you could use them using the existing CUDNN RNN interface. Rather, people have preferred to use striding on the inputs directly, from what I have observed.

This is the first paper to use attentional models in the large vocabulary (“LVCSR”) setting. The previous model of its type simply output phoneme sequences on the fairly small TIMIT dataset (which is noteworthy for being the only speech dataset that I know of which has time-aligned phoneme labels).

It is surprising that the paper describes only a very particular setup, including that the decoder RNN, if I recall correctly, uses two layers. Meanwhile, the encoder uses three pyramidal RNNs with hidden state sizes of 512. (512 is small enough to use persistent RNNs with a batchsize of ~4).

They do not constrain the decoder with a dictionary of words. Wow! “[W]e found that this was not necessary since the model learns to spell real words almost all the time.” This is surprising, since the language model rescoring step would presumably do quite badly if a word was mispelled even a little bit. I am surprised that they did not list word error rate results (Table 1) for decoding with a dictionary, since that would almost certainly improve word error rate. Even a typo of a single letter can increase word error rate dramatically.

The model was not robust when they teacher-forced 100% of the time during training. Rather, an output is sampled from the decoder at each time step 10% of the time. This presumably makes parallelization quite hard on GPUs, since only some samples in a batch will require the sampling stage to be done. An obvious way to get around this problem is to use the same Bernoulli random variable for each batch.

This is 2015, so they use asynchronous distributed training. Alas.

The paper uses bucketing, a simple but annoying to implement technique which batches which has been carried into future work, such as Lingvo.

Attention-based models do not have the monotonicity assumption that hybrid HMM-DNN models and CTC loss-trained models have, but they are monotonic anyway. (Monotonicity is the property that if word A is output by the decoder at time t and word B is output by the decoder at time t + dt, then the audio corresponding to word A comes from a time before the audio corresponding to word B as well.) However, the attention mechanism naturally looks monotonic, at least in the single example in Figure 2, which makes sense.

This paper does not reach state-of-the-art, shows results on only Librispeech, and is not an online decoder, but that just gives me an appreciation for how hard it can be to do good work.