Creating a Server-Side Streaming Automatic Speech Recognition (ASR) System

There is surprisingly little documented system design knowledge on the public web about designing a server-based speech recognition system.

As the MLCommons Inference Reference Benchmark owner for Automatic Speech Recognition (ASR), I would like to write something about this that others could refer to. The current MLCommons ASR inference benchmark is not reflective of industry use cases for multiple reasons right now that I won’t get into here.

It is straightforward to use a scale-out, shared-nothing CPU approach. Essentially, a reverse proxy like NGINX would assign a sticky connection based on cookie (or some sort of user ID) to a particular single-core CPU host. This CPU host would run the speech recognition model (i.e., the acoustic and language models). There would be no batching or multi-tenancy whatsoever.

It is worth noting that you may want a log-based queueing system like Kafka between the client and the ASR servers. For example, if you want to do speaker diarization, to detect who is speaking when, it is typical for a totally different model to do that. In addition, this is a safe-guard against. Finally, by saving the input audio, you can replay particular challenging inputs against your system. Be aware of GDPR complications regarding saving this user data, though.

You should also consider whether your model should output the most likely transcript, or something lower level, like a “lattice” (a compact representation of the n-best transcripts). Consider that your product is a “meeting intelligence service” that does live transcription (e.g., for hard of hearing people or those who can read better than they can hear), followed by creating a searchable index of what was said in the meeting that participants can refer back to. The searchable index requires an n-best list in order to be able to select words that are not in the best list, but perhaps in the second or third best list. You can refer to this paper to learn more. In addition, it’s common to do an offline “rescoring” of the first online output lattice to create a “better” one. I can’t find a concise definition of this, though.

Finally, the low-latency requirement of streaming speech recognition is at odds with accurate speech recognition in a very particular way. It is well known in the literature that increasing the “right context” of your acoustic model will improve its accuracy. Right context is the future audio that an acoustic model can see before it predicts what is being said right now. Concretely, a model with 100 milliseconds of right context will be able to see 100 milliseconds of audio into the future when it is predicting the grapheme, phoneme, or whatever-label-you-are-using.

By having saved the input audio via a message broker like Kafka, you can run an offline decoding process after the audio stream is closed (or after the stream has advanced more than “right context” milliseconds). Note that this basically doubles your compute requirements. Secondly, it requires you to train two models: an online model and an offline model. Your offline model better be definitely better than the online one to make this worth it!

Note that taking this observation to the furthest extreme for offline-only models by using bidirectional recurrent neural networks is a pitfall. Each bidirectional RNN must process its entire input from left-to-right and right-to-left. If you have an hour-long audio, this means your input must be wholly in memory to be used. This can easily lead to out-of-memory errors, implying that malicious clients can do a denial of service attack by repeatedly sending too-large payloads. There is also next-to-no memory reuse. Finally, even if you got around all of those issues, you would have a training-inference mismatch. It is uncommon for speech recognition training samples to have a length greater than 1 minute. A “Latency-controlled Bidirectional RNN” model is designed to solve some of the problems described in this paragraph.


This is already quite long. So I’ll stop here. In the future I would like to write about:

  • Client-side considerations (hot word spotting, voice activity detection, endpointing, audio encoding codec) and implications of these decisions for the server-side.
  • The problems with the one-CPU-per-audio-stream approach. This is probably the most important one to understand for a system builder today. It motivates the usage of custom accelerators, which have their own problems. It also motivates a service-oriented architecture, which has been briefly touched upon here via the mention of Kafka and using separate offline and online decoders. For example, it is unlikely that your accelerators have implementations of decoders for the common audio codecs used by clients.

Random notes I may move to a differnet post:

Model considerations

It is reasonable to expect that you would use a neural network at this point for automatic speech recognition, at least for the acoustic model.

It is important to add that the featurization of the audio waveform to your neural network model’s graph to minimize chances of training-inference mismatch. In fact, it’s reasonable that you may want to convert the FFT used in featurization to a DFT, since you can reuse matrix-vector kernels this way (to reduce code space of your binary). If you work at Google and are unfortunately required to use a TPU, you basically need to use a DFT instead of an FFT since the TPU’s VLIW processor is not very fast relative to the systolic array processor.

For the MLCommons inference benchmark, we found that the hidden state of the LSTMs in RNN-T seemed to increase in magnitude as the RNN’s sequence went up. This didn’t matter for the v0.7 inference benchmark, where each piece of audio was at most 15 seconds long. However, it could cause a training-inference mismatch, since in practice your audio sequences may be much longer Reference this issue. This is one argument in favor of convolution-based speech recognition architectures. Because they have a fixed context window, they can’t have exploding hidden states in the way an LSTM can.