How to Get Up to Speed in Speech Recognition Fast
I’ve noticed that there doesn’t seem to be a common resource out there on modern speech recognition that gives people in the field a common foundation. At the very least, there is nothing like CS231n for speech. This is my small effort to fix some of these problems.
The best books out there to learn speech recognition in my opinion are:
HTK Book - The first few chapters of this book describe common feature representations and how speech recognition has been done for many, many years, including today. Just google to find it.
Statistical Methods for Speech Recognition - Classic book, very well written despite its age. Motivates a lot of the common graph search algorithms and so on still used in speech recognition today.
Automatic Speech Recognition: A Deep Learning Approach - Best book describing “DNN-HMM” systems, which were the first deep learning systems used for speech (but nevertheless still used by multiple industry players today). These have the trade-off that they require “GMM pretraining”, preventing them from being fully “end-to-end” trainable.
Sadly, I don’t know a good resource for end-to-end ASR, a la Deep Speech 2. The Deep Speech 2 paper is very cool, but it lacks important details like how to implement a decoder for a trained model. I may try to write something up myself at some point.
- A Bit of Progress in Language Modeling. Summarizes the problems with estimating N-gram language models. pocolm is the best work based on thsi which has a license suitable for commercial usage. Unfortunately, it is a little bit buggy when supporting unicode characters in python 3. If you encounter this issue, contact me, and I can prioritize fixing it.
Finite State Transducers
Finite State Transducers (FSTs) are magical data structures which can represent HMMs, N-gram language models, pronunciations, spellings, and all kinds of things. The “composition” operator on FSTs allows you to build up a decoding graph as described in Statistical Methods for Speech Recognition in a principled way. Just read the legendary HBKA to understand everything: https://cs.nyu.edu/~mohri/pub/hbka.pdf
Decoding, the process of taking a stream of audio to a stream of words, is a black art. The best documented open-source decoders out there are in the Kaldi speech recognition toolkit. At risk of throwing people into the deep end, if you put in the 5+ hours to read and fully understand http://kaldi-asr.org/doc/decoders.html followed by http://kaldi-asr.org/doc/lattices.html, you will be good.