Forced Alignment 101
For those who aren’t aware, I’ve been advising a project to create a sizable dataset of permissively licensed audio data with transcripts over this year. It is intended to be a fairly challenging baseline for speech recognition research, since the fairly clean speech of Librispeech is now reaching ~2% word error rate (WER), which is ostensibly lower than a human’s WER, and therefore perhaps not all that meaningful.
The primary mechanism by which audio data with transcripts is turned into labeled trainign data is “forced alignment”.
There is no go-to resource on forced alignment in the same way that you could watch a video from Andrew Ng to get the gist of, e.g., logistic regression.
This post aims to give a better idea of what problems forced aligners solve and some of the approaches to creating a forced aligner.
The output of forced alignment in a machine learing pipeline:
Gives timestamps to particular words, or graphemes, phonemes, or of individual states in your probabilistic graphical model (most likely a Hidden Markov Model).
Detects out “ums” and interjections (insertions in the audio that don’t appear in your transcript).
Detects problems in the transcript (“insertions” in the transcript). For example, your transcript, if it comes from a WebVTT file, may contains pseudo-CSS classes, which are metadata that don’t. Or perhaps your source transcript comes from an HTML webpage, and each paragraph in the transcript is surrounded by
This allows you to chunk up your source audio into reasonably sized chunks (probably up to 1 minute or so) while removing (1) silence and (2) spoken segments that don’t match your transcript at all.
Why is forced alignment necessary for end-to-end deep learning models like CTC?
CTC claims to remove the requirement for forced alignment. Quoted from the abstract of the original paper: “This paper presents a novel method for training RNNs to label unsegmented sequences directly.”
Note that “unsegmented” here means “not force aligned”. I use the term forced alignment because it is less ambiguous than segmentation. The techniques in this paper allow you to train on chunks of audio with accompanying labels, without doing something obscure.
What CTC does not allow you to do is train on, e.g., a one hour video with an accompanying transcript. The memory usage of the CTC loss function is O(T*L), where T is the audio length and L is the length of the transcript. Naturally, as the audio increases in length, so does the transcript, so this is essentially quadratic. In addition, to do backpropagation through time, you need to store an amount of cached activations linear in the length of your audio. For training, the sweetspot for each sample is to have about 15 seconds to 1 minute of audio. This way, your speech-to-text model has a good amount of surrounding context, but training won’t.
In this sense, end-to-end speech recognition today is a lie. You still need to do the forced alignment stage when you get a new dataset, which today is a fairly involved. Most researchers, however, use existing datasets like librispeech so that they can compare their work with others.
My hope is that, with The People’s Speech, and the accompanying software to do forced alignment in open source land (Librispeech’s forced alignment sotware was a fork of kaldi which as far as I know was never put anywhere on the Internet), researchers can consider applying ideas like reversible layers and differentiation through beam search to allow models to train on larger chunks of data than a single minute. This would reduce the training-inference mismatch on model, which hopefully is beneficial for deployed models. They can create those larger chunks of data by tuning the parametersof the forced aligner software to generate larger chunks of audio. As long as the dev and test sets are stable, people should still be able to make reasonable comparisons of approaches.