Improving the Efficiency of Forward-Backward Algorithm Using Batched Computation in Tensorflow

This paper is a follow-up to An Efficient Phone N-gram Forward-backward Computation Using Dense Matrix Multiplication, which describes implementing the forward-backward algorithm, which is used in basically all “sequence-based” or “discriminative” objective functions, as multiplying a block sparse matrix by a vector multiple times. Essentially, the forward part of the algorithm is an RNN with an identity nonlinearity, and the backward part of the algorithm is also an RNN that processes from the end of the data to the beginning.

In this paper, they extend that work to batch the matrix multiply, going from GEMV to GEMM, and deal with padding issues that inevitable show up when batching sequence data together.

Apparently, according to Streaming End-to-End Speech Recognition for Mobile Devices, this work can also be extended to train RNN-Transducers as well, but I am not familiar with RNN-T models, unfortunately. I don’t know of any public source code extending this work to work on RNN-Transducers at the moment.

Why it matters

TPUs are not good at anything other than matrix-matrix multiplications. This is a trend common among all accelerators, described here: http://www.cudahandbook.com/2017/10/dont-move-the-data/ The short of it is that you need to reuse each piece of data you pull from off-chip memory multiple times in order to reach the peak FLOP rate of an accelerator. GEMM, the jargon for multiplication for multiplying two dense matrices together, is one of the few algorithms capable of doing this. By implementing speech recognition objective functions on accelerators, there is no costly transfer of the logits to the CPU to compute the objective function. In fact, the computation of the MMI objective function on CPU’s is one of the bottlenecks in Pykaldi2: https://github.com/jzlianglu/pykaldi2#training-speed This paper is important because it shows how to use modern deep learning accelerators effectively in speech recognition.

Implementation Details

Batching numerator matrices is tricky. This is because they each have a separate transition matrix. However, they use something called “state-padding” to get around this. A worked example of this owuld be very useful.

An matrix is used to map states to output labels. It is not clear to me how these two differ right now.

Other Avenues

I am a little bit skeptical about the authors’ claims that they see improvement from using block-sparse multiplication. It seems to me that it would be better to implement RNNs using sparse matrix multiplication using merge-path-based sparse-matrix-by-dense-vector (SpMV) or sparse-matrix-by-dense-matrix (when you have no batches) (SPMM) (for batching) implementations. This avoids some of the bizarre restrictions on the transition matrix described in An Efficient Phone N-gram Forward-backward Computation Using Dense Matrix Multiplication. Note that the former is a template library that can be specialized for using log domain calculations (see Sec. 4.1 Coping with Numerical Stability), while the latter library is incomplete in that regard, unfortunately. This works only for GPUs, not TPUs, but my impression is that that is fine.

Note: It would be good to set up a simple biography for these kinds of blog posts.