Benchmarking CTC Implementations in CUDA

Connectionist Temporal Classification (CTC) was the first used objective function used for fully neural end-to-end speech recognition.

This blog post will be helpful to those seeking to understand how to squeeze extra performance out of a CUDA kernel.

I based this work on the DeepBench repo, creating my own fork here. I also forked Warp CTC, making small modifications for the sake of fair comparison in the branch DeepBench-baseline.

To explore the CUDNN CTC implementation, I also had to update the maxas, Scott Gray’s Maxwell Assembler software, to support some oddities in NVIDIA’s cuobjdump output format relating to double-issue instructions after CUDA 6.5. I needed to use a newer version of cuobjdump than CUDA 6.5’s version since I was investigating Pascal SASS (Maxwell and Pascal share an instruction set, but cuobjdump 6.5 will still complain about reading a cubin file with a Pascal header). This should make maxas easier to install now, since it will work out of the box with any recent CUDA install, and now supports Pascal. I hope to publicize this at some point, since it took me some time to get these issues sorted out.

CTC Benchmarks

These Benchmarks were run on a Quadro P4000. (Simply because that’s what I had handy!)

In the below table, T is the number of time steps over which posterior probabilities exists, N is the batch size, and A is label length. Since typically speech is split into frames that are strided by 10 milliseconds each, this corresponds to 1.5 seconds, which is a fairly common size to train on in ASR.

Perceptive readers will note that the length of the output label sequence (usually denoted L) is omitted. In order to simulate real-world conditions where a given input sequence corresponds to a non-fixed output length, I sampled L uniformly between 1 and T, inclusive.

The libraries receive activations, not probabilities as input. This essentially means that both libraries have to do the same amount of work.

T N A Cudnn 7 time (usec) Warp CTC time (usec) Warp CTC Opt. 1
150 1 28 275 1596 1438
150 2 28 395 4405 3840
150 4 28 520 3958 3766
150 8 28 719 4096 3866
150 16 28 1178 4094 3890
150 32 28 1815 4164 3966
150 64 28 3189 4310 4120
150 128 28 6096 3818 3742
150 256 28 12142 6623 6526
150 1 5000 321 6890 6860
150 2 5000 556 11382 11289
150 4 5000 862 11713 11639
150 8 5000 1418 12221 12157
150 16 5000 2576 13343 13272
150 32 5000 4771 15500 15458
150 64 5000 9516 15498 15489
150 128 5000 18954 24524 24515
150 256 5000 37732 48168 48178

Optimization 1: Use __restrict__ pointers!

The __restrict__ keyword tells the compiler that the memory region pointed to by a given pointer will not be shared (‘aliased’) by the memory space pointed to by any other pointer argument to the function. This means that a write to memory region A is guaranteed not to change memory region B. The upshot is that the compiler can load memory into registers far in advance of having to having to use them. Normally, the compiler would have to load memory right before use because of the possibility of aliasing. This can hide the memory read latency by instruction-level parallelism: the scalar processors of the GPU can do work while

Notably, results improve only when A=28, not when A=5000. This is because the number of thread blocks grows with the alphabet size. With more thread blocks, thread-level parallelism can be used instead to hide memory read latencies.

This simple change is demonstrated in this commit.

Things which didn’t work

The __ldg intrinsic

CUDA provides an intrinsic function named __ldg() to cache global memory reads in the fused L1/Texture Cache. I tried minimizing Warp CTC’s shared memory usage for ground-truth output labels (this shared memory was used only as a cache, not for cooperation among threads), and wrapped memory reads to the global labels array with __ldg().

I also called cudaFuncSetCacheConfig(compute_alpha_kernel<ProbT, NT, VT>,cudaFuncCachePreferL1) and cudaFuncSetCacheConfig(compute_beta_and_grads_kernel<ProbT, NT, VT>,cudaFuncCachePreferL1), to make sure that the L1 cache was larger (and thus hit more often), given that shared memory was no longer being used in compute_alpha_kernel and reduced in compute_beta_and_grads_kernel.

The kernel to compute alpha values is notably wasteful because it computes the entire alpha matrix, when it really only needs two columns of alpha (the current time step and the previous time step). Depending on your alphabet size, it is possible you could fit these two columns into shared memory, probably giving a large speedup. Warp CTC already does this with the beta matrix.

Based on this experiment, I’d recommend manually managing shared memory as cache instead of trying to tune CUDA’s caching behaviors. It is much easier to reason about.

Tune Register Blocking with Only Odd Grain Sizes

Warp CTC employees a rather obscure CUDA technique called “register blocking.” This technique allows you to increase and decrease the amount of work a given thread does via a template parameter. This can help search the space of shared memory, residency, and register file constraints in CUDA programming more easily. The technique is documented and liberally employed in the ModernGPU library. What we care about here is that its performance section notes that the “grain size” (which controls the size of the block) should always be odd:

The omnipresent grain size parameter VT [which controls register block size] is almost always an odd integer in MGPU code. This choice allows us to step over bank-conflict issues that would otherwise need to be resolved with padding and complicated index calculations. Kernels execute correctly on even grain sizes, but expect diminished performance on these. Omit even grain sizes when searching the tuning space.

Warp CTC actually violates this! It uses even grain sizes like 2, 4, 6, and so on. Why? I’m not sure! But it seemed worthwhile to correct for this and see if there was an improvement.

In my experiments, there was no improvement. It might be worthwhile to change the code so that output label sizes can be specified, as this controls which template instantiation gets called. But at the least, it proves that register blocking won’t be a magic bullet.

Update: It makes sense that using only odd grain sizes would not help. These prevent memory bank conflicts only when threads access contiguous blocks of memory in sequence. However, in the warp-ctc kernel, all memory accesses are strided by the block size.

Why is CuDNN CTC so much faster?

Given that the venerable Warp CTC appears hard to improve without architectural changes, the next obvious question is why CuDNN CTC is about an order of magnitude faster.

I intend to look into this, but for now, here are the outputs of nvprof for me:

CuDNN CTC: nvprof bin/ctc-bench cudnn-ctc

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   71.96%  7.55546s      1616  4.6754ms  113.06us  27.655ms  void compute_grads_deterministic<float, int=32, int=16>(float const *, int const *, int const *, int const *, int const *, int const *, int*, float*, float*, float*, float*, int, int, int, int)
                   18.42%  1.93407s      1818  1.0638ms  8.9910us  8.2578ms  void cudnn::detail::softmax_fw_kernel_resident<int=2, float, float, int=256, int=1, int=0, int=0, int=32, int=0>(cudnnTensorStruct, float const *, cudnn::detail::softmax_fw_kernel_resident<int=2, float, float, int=256, int=1, int=0, int=0, int=32, int=0>, float*, int, float, float*, int, int)
                    8.85%  928.70ms      1818  510.83us  192.99us  2.0357ms  void compute_alphas_and_betas<float>(float const *, int const *, int const *, int const *, int const *, int const *, int*, float*, float*, float*, float*, int, int, int, int)
                    0.52%  54.474ms        18  3.0263ms  2.6860ms  3.8294ms  generate_seed_pseudo(unsigned __int64, unsigned __int64, unsigned __int64, curandOrdering, curandStateXORWOW*, unsigned int*)
                    0.09%  9.6613ms      9495  1.0170us     576ns  13.056us  [CUDA memcpy HtoD]
                    0.09%  8.9256ms       202  44.186us  14.623us  74.719us  void compute_grads_deterministic<float, int=32, int=1>(float const *, int const *, int const *, int const *, int const *, int const *, int*, float*, float*, float*, float*, int, int, int, int)
                    0.08%  8.0846ms        18  449.15us  5.4720us  3.7402ms  void gen_sequenced<curandStateXORWOW, float, int, __operator_&__(float curand_uniform_noargs<curandStateXORWOW>(curandStateXORWOW*, int))>(curandStateXORWOW*, float*, unsigned long, unsigned long, int)

Warp CTC: nvprof bin/ctc-bench warp-ctc

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   33.92%  6.68129s      1111  6.0138ms  2.6615ms  10.267ms  void compute_betas_and_grad_kernel<float, int=32, int=9>(float const *, int const *, int const *, int const *, int const *, float*, float const *, float*, float*, int, int, int, int, int)
                   20.11%  3.96064s       505  7.8429ms  2.7122ms  16.368ms  void compute_betas_and_grad_kernel<float, int=64, int=6>(float const *, int const *, int const *, int const *, int const *, float*, float const *, float*, float*, int, int, int, int, int)
                    8.30%  1.63482s      1818  899.24us  1.4080us  8.1429ms  void compute_probs_kernel<float, int=1, ctc_helper::exponential<float, float>>(float, float*, float const *, int, int)
                    8.10%  1.59451s      1818  877.07us  1.0880us  7.9533ms  [CUDA memcpy DtoD]
                    7.93%  1.56164s      1818  858.99us  1.2800us  7.7846ms  void prepare_stable_SM_kernel<float, int=1, ctc_helper::identity<float, float>>(float, float*, 
float const *, int, int)
                    6.81%  1.34087s      1111  1.2069ms  1.0385ms  1.4840ms  void compute_alpha_kernel<float, int=32, int=9>(float const *, int const *, int const *, int co
nst *, int const *, int const *, int*, float*, float*, int, int, int, int, int)
                    3.93%  774.89ms       202  3.8361ms  1.2265ms  6.4689ms  void compute_betas_and_grad_kernel<float, int=32, int=1>(float const *, int const *, int const 
*, int const *, int const *, float*, float const *, float*, float*, int, int, int, int, int)
                    3.84%  756.44ms      1818  416.08us  2.4960us  3.5915ms  void reduce_rows<int=128, ctc_helper::exponential<float, float>, ctc_helper::add<float, float>,
 float>(float, float, ctc_helper::exponential<float, float> const *, ctc_helper::exponential<float, float>*, int, int)
                    3.82%  751.52ms      1818  413.38us  2.2400us  3.5674ms  void reduce_rows<int=128, ctc_helper::identity<float, float>, ctc_helper::maximum<float, float>
, float>(float, float, ctc_helper::identity<float, float> const *, ctc_helper::identity<float, float>*, int, int)
                    2.62%  516.30ms       505  1.0224ms  729.11us  1.5952ms  void compute_alpha_kernel<float, int=64, int=6>(float const *, int const *, int const *, int co
nst *, int const *, int const *, int*, float*, float*, int, int, int, int, int)
                    0.27%  54.086ms       202  267.75us  242.11us  305.41us  void compute_alpha_kernel<float, int=32, int=1>(float const *, int const *, int const *, int co
nst *, int const *, int const *, int*, float*, float*, int, int, int, int, int)
                    0.26%  51.140ms        18  2.8411ms  2.6791ms  3.8951ms  generate_seed_pseudo(unsigned __int64, unsigned __int64, unsigned __int64, curandOrdering, cura
ndStateXORWOW*, unsigned int*)
                    0.05%  9.8540ms     10707     920ns     607ns  8.4160us  [CUDA memcpy HtoD]
                    0.04%  8.1747ms        18  454.15us  4.8640us  4.0550ms  void gen_sequenced<curandStateXORWOW, float, int, __operator_&__(float curand_uniform_noargs<cu
randStateXORWOW>(curandStateXORWOW*, int))>(curandStateXORWOW*, float*, unsigned long, unsigned long, int)
                    0.00%  1.0880us         1  1.0880us  1.0880us  1.0880us  [CUDA memset]

Clearly, CuDNN is limited by the gradient computation, something not obvious from Warp CTC’s code profiling.

I intend to dissect how CuDNN gets a 10x speedup, and whether this is the limit, this at some point, but not today.

Dissecting CUDNN7’s CTC

How to get a disassembled alpha-beta kernel. You need get libcudnn 7.0.5.15 to work on the same code as mine. You can get it quickly (no login wall) by looking at these Dockerfiles for (Ubuntu 16.04)[https://gitlab.com/nvidia/cuda/blob/ubuntu16.04/9.0/devel/cudnn7/Dockerfile] and (Centos 7)[https://gitlab.com/nvidia/cuda/blob/centos7/9.0/devel/cudnn7/Dockerfile]. The Centos 7 Dockerfile downloads a generic .tar.gz which you could use on any operating system, in case you’re using something weird like SuSE Linux or Arch.

You also need to install my fork of (maxas)[https://github.com/galv/maxas]. I used CUDA 9.0, but any version should work fine.

cp /usr/lib/x86_64-linux-gnu/libcudnn_static_v7.a .
ar x libcudnn_static_v7.a
maxas.pl --extract --kernel _Z24compute_alphas_and_betasIfEvPKT_PKiS4_S4_S4_S4_PiPS0_S6_S6_S6_iiii ctc.5.sm_61.cubin > ctc.5.sm_61.compute_alphas_and_betas_fix.msass

I am still studying the disassembled output. But it looks like cudnn may owe its speedup to working in the probability space, not the log probability space.