Investigating wav2letter

A bit of jargon

I will refer to “lattice-free maximum mutual information” (LF-MMI), a popular objective function interchangeably with “chain”, a popular speech recognition architecture used in Kaldi. This is a little bit wrong because LF-MMI is an objective function, while chain is a hybrid DNN-HMM with a specific HMM architecture which happens to be trained by LF-MMI. You can learn more about LF-MMI in its original paper, and chain in the Kaldi documenation.

wav2letter’s AutoSeg Criterion as Lattice-Free MMI

Facebook AI Research recently open-sourced an end-to-end DNN acoustic model they call wav2letter. I took note because of their interesting objective function, which they call the AutoSeg Criterion (ASG):

When you take into account the fact that minimizing \( ASG(\theta, T) \) is equivalent to maximizing \( - ASG(\theta, T) \), and that maximizing \( \exp(- ASG(\theta, T)) \) is equivalent to maximizing \( - ASG(\theta, T) \) (because of monotonicity), a familiar equivalent equation shows up:

That is to say, this is a globally normalized model: it normalizes over all possible paths through the decoding graph (like LF-MMI), not just over the outer edges of each state of the decoding graph (like CTC). The only real difference I see is that ASG uses letters, while LF-MMI uses conext-dependent phone substates (the hold-over from GMM-HMMs which prevents LF-MMI from being an “end-to-end” model, if that’s something you really want). Both ASG and LF-MMI training can be cast as running the foward-backward algorithm on a finite state transducer.

Why does this matter?

A couple of reasons:

1) We can leverage the finite state transducer framework to implement computing the gradients of both LF-MMI and ASG via forward-backward using the same algorithm: a sparse matrix vector multiply using the Log semiring. There is a provably efficient (in the sense that there will be an upper bound on how much work each individual thread will have to do, so that no processors are left waiting idly) implementation of this on GPU already by the almighty Duane Merrill: https://github.com/dumerrill/merge-spmv (Note that NVIDIA’s nvgraph library’s nvgraphSrSpmv() is based upon this, but it does not expose the Log semiring as part of its interface, unfortunately).

2) We can incorporate language model information into acoustic model training trivially this way. Simply do compose(S, G), where S is your spelling finite state transducer, and G is your ngram language model to get ASG’s [G_{full}]. Edges between words will have the corresponding weights of the N-Gram language model. The only previous work to explore this that I know of is Baidu’s Cold Fusion. However, Cold Fusion does not take advantage of a globally normalized objective function. It also does not lend itself to a trivial implementation via existing algorithms because it focuses on neural net language models, rather than n-gram language models. (This is an opinion, but I think neural net language models are better for rescoring decoded output, rather than doing online decoding.) Why haven’t we done this same thing yet with LF-MMI? Concerns about over-training, and also an assumption that acoustic models need to be trained on only small sequences of time (typically, 450 10ms frames, or 4.5 seconds, not quite enough for most sentences in read speech (e.g., Librispeech)). Since end-to-end models don’t have the luxury of a GMM-HMM to give precise time alignments for segments shorter than single sentences, perhaps a language model can help them out with figuring out the alignment.

3) This might be an opportunity to shoehorn speech decoders, which traditionally sit on top of a neural network, into the neural network itself. Engineering-wise, this lowers the barrier to entry into using automatic speech recognition in an application tremendously. Science-wise, this hopefully would reduce the amount of redundant work done by multiple groups.