Passing the NVIDIA Software Engineering Interview
Around this time last year, I went through the NVIDIA software engineering interview process. Surprisingly, there is not a lot of information out there on how to pass this interview, which is not as cut-and-dry as companies following the Cracking the Coding Interview industry standard. While I did not accept the job offer, I got a great impression of the company. It is one of the few publicly traded semiconductor company which maintains an incredibly collaborative culture that leads with brains. As an anecdote, Ex-Intel employees recruited to NVIDIA are called “Intel survivors”.
NVIDIA may be less sexy after the stock fell recently, but it is still a company where a lot of smart people work, and a great career move for ambitious people looking to grow at the insection of computer architecture and machine learning. I hope this information is helpful to others.
I got in touch with NVIDIA recruiters at one of the big machine learning conferences at which I was presenting a paper, where they wined and dined us at a fancy steak house in the penthouse of a beach-side building. Jensen Huang also made an appearance and talked about how when he founded NVIDIA, his team created an “autonomous” machine, a vector processor on desktops with no connection to the internet. Naturally, he handed out a Titan X to one lucky attendee. A month later in January, a recruiter reached out and I started the pipeline. I did not get an offer until around April that year.
How NVIDIA interviews
Each department at NVIDIA decides how to interview individually.
In general, you can expect the following format, though:
- Phone screen with recruiter
- Two phone interviews, one of which may be with the hiring manager
- One on-site, typically all day or half a day
- (Depends) Final interview with the VP of the division you are applying to for culture fit and a final okay.
I interviewed for two separate positions, Computer Vision Engineer, and AI Developer Technology (DevTech) Engineer. The former is a “library” position, which develops the C, C++ and, CUDA libraries that NVIDIA distributes to developers for its hardware. The latter is NVIDIA’s somewhat skunks work group of engineers who interact directly with customers with unmet needs. I would highly recommend working as a DevTech. It looks like a great opportunity for building your brand in the community, and you get to be exposed to a lot of parts of the business. Many of the NVIDIA-employed open source contributors were Dev Techs, with the exception of the recently formed “frameworks team”, which is composed of engineers supporting NVIDIA’s software stack in the major deep learning frameworks.
NVIDIA has the hardest interview in terms of sheer knowledge required of any tech company I have interviewed at. At a minimum, you must understand:
- Memory access latency on for L1 and L2 cache on fairly modern CPU
- You can also impress interviewers with in-depth understanding of NVIDIA’s GPUs’ microarchitecture. The best resources for this are the two reverse engineering efforts by Citadel Research for Volta and Turing. After the talk on the Turing paper at GTC this year, the NVIDIA employee hosting the talk offered the presenting author a fancy lunch at a nearby hotel in San Jose.
- How to implement GEMM in CUDA.
- This example demonstrates “blocking”, which you absolutely need to know.
- I studied Maxas by Scott Gray, an improvement upon the above example, which has all sorts of microarchitectural gems like the concept of register bank conflicts. However, this took me days to understand.
- At this point, it is probably better to study CUTLASS, a template library home-grown by NVIDIA and used within their other libraries like CUDNN. Its documentation is better written than Scott’s and it is newer. Plus, it is written by people internal to NVIDIA.
- Machine Learning 101 and Deep Learning 101.
- I found this to be easier than most things. Basically, one of the engineers who worked on deep learning frameworks at NVIDIA asked me about things like how SVMs work, how backprop works, and so on. It was fairly casual. At one point, I was asked about regularization, and I mentioned that people tend to use dropout and batch normalization, but that I tend to use dropout. I was then asked why I did not use batch norm, and I said “just cuz”. Really, I had no good reason. Good thing that deep learning is still an art.
- Why are GPUs effective at Deep Learning work loads?
- This is a “spot check” I was asked in a phone screen.
- It is surprising how few people know why, in spite of all the hype.
Not only can GPUs do parallel operations, they have much more compute ability than memory bandwidth. As an extreme case, The V100 has 125 TFLOPs/second tensor cores, but only 900GB/second of theoretical memory bandwidth (the true bandwidth is lower) using state-of-the-art HBM2 memory, which is designed by only two companies (SK Hynix and another one I’ve forgotten). This means that you need to do, at minimum, 278(=125e12 / 450e9) floating point operations (or 139 FMACs) on every input to fully utilize tensor cores. This makes them optimized for operations like matrix multiplies and convolutions, which do many floating point operations on each element of memory brought in. If you weve ever wondering why sparse training of neural networks was hard, this has something to do with it: multiplying by a sparse matrix is a memory bandwidth-bound operation.
You can read more about this more here: http://www.cudahandbook.com/2017/10/dont-move-the-data/ If this concerns you deeply (and it probably should), it is worthwhile to consider the much-higher bandwidth chip offerings by new upstarts like Graphcore and Cerebras; they are hiring and an NVIDIA offer will make you attractive in their eyes. I suspect off-chip communication bandwidth concerns are a big reason why NVIDIA purchased Mellanox recently. Integrated solutions are becoming more important to utilize accelerators fully.
- How to interact with an asycnhronous computing device. Suppose you have an accelerator to which you submit tasks 1, 2, and 3. However, the device may finish these tasks in any order. How do you make sure that the user of you the device receives the complete tasks outputs in the order in which they were submitted?
- General CUDA programming competency is useful. The best book (even
though it is dated) on this is Nicholas
Wilt’s The CUDA Handbook. He was one
of the three original creators of CUDA, and his book and blog
provide historical perspective on what has led NVIDIA to the current
- If you want an architectural understanding of GPUs, you can read the GPU section in Computer Architecture: A Quantitative Approach. In my fifth edition of the book, it is section 4.4. It was written in collaboration with some of the lead GPU architects at NVIDIA.
- What is memory oversubscription in operating systems, and how do you know if your application is being adversely affected by it? How do GPUs provide the ability to oversubscribe memory?
- How do page tables work? Modern GPUs (I believe anything Maxwell generation or above, so not K80’s) use page tables, which allow them to do cool things like accessing host memory and memory on other GPUs via NVLink. This is done by generating pagefaults.
- The most practical OS resource that I found was Robert Love’s Linux Kernel Development. It fills in a lot of the practical knowledge that my Operating Systems class pretended did not matter.
- What is all-reduce? How would you implement it on the NVIDIA DGX-1 topology? This question probably trips up most people, since MPI has been a dying art until recently data parallelism brought it back.
- N-body problem. Like GEMM and convolution, N-body is well-suited to GPU’s. Make sure that you know how to write an efficient implementation. The CUDA Handbook has a chapter on this.
- General algo knowledge. If you’ve interviewed at other tech companies, you should know what I’m talking about. NVIDIA won’t ask you for another silly variant on depth-first search. Instead, I was asked how to implement an efficient breadth-first search on GPUs, which involved deriving it via inducing a semi-ring on a sparse matrix-dense vector multiply, and then discussing how to deal with dynamic sparsity in my graph, represented as a CSR adjacency matrix, via the MergePath algorithm. One of the craziest interviews I’ve ever done!
- Cache-aware algorithms. Try finding all the connected components, areas occupied by the same number, on a 2D matrix of integers. Assign each of these regions a unique integer. Answer: DFS. Now try finding all the connected components when you can load only two rows of that row-major matrix into your L1 cache at a time. Not so easy.
- Culture fit: NVIDIA is a company of do-ers like relatively little hierarchy. They like people who are open-minded and willing to do things they haven’t done before. They like people who can learn on the job (basically a necessity). People there are very pedantic and leaders tend to have a technical perspective to things. If this sounds like you, you will pass the cultural part of the interview just fine.
- Bonus: Understand how to implement fast Convolution via Winograd and implicit GEMM, and fast RNNs via persistent kernels. I was really concerned that an interviewer would ask me how to implement convolution on a GPU, but it was never asked. Probably, most interviewers don’t know how to impement it efficiently because it is so hard! If you know this information anyway, it will help you understand the cutting edge of high performance computing for deep learning, so it is still probably a good idea.
The Offer and Beyond
NVIDIA will give you a verbal offer and expect you to commit to it on the spot, and otherwise will give you only a few days to decide. I was explicitly told that I would get a written offer only if I said yes, and it was assumed that I would 100% take the offer if I did that. I actually extended my offer deadline by two weeks after a lot of back and forth (I really wasn’t ready to decide at all!) with the recruiter. The initial offer was also quite poor, which I was able to bring up quite a bit with the tactics in the book Never Split the Difference. Even then, though, there were weird tactics like making the signing bonus be prorated over two years instead of one, in case I left before two years were up. Be careful.
I actually think that that this was a poor end to an otherwise great candidate experience. I considered it pretty rude that I was expected to decide with almost no time. All the other established tech companies I’ve worked with give a written offer with a reasonable deadline. But just remember, if you say no politely, you can always revisit the offer at a future time!