Glossary

gpu register - pretty much the same as a CPU register; they hold 32 bit floating point numbers and do things like load/store data and arithmetic operations
- they hold the local data for each thread, things like temporary variables, program counters, etc
gpu thread - piece that operates on a bit of data and has it’s own registers
warp - group of (usually 32) threads that execute in lockstep
- All threads in a warp execute the same instruction simultaneously on different data
- Each thread within a warp has its own set of registers
block (threadBlock) - a 3d collection of threads
grid - a 3d collection of blocks
streaming multiprocessor (SM) - in short, the thing doing the computations on the GPU. There’s a double digit number of these on a given GPU, each of which has its own set of registers and can operate in parallel with the rest of the SMs.
occupancy - the number of active threads per SM
CUDA kernel - a pointer to the data to compute on and some instructions for the computation.
- A CUDA kernel is composed of threads, but these threads are organized hierarchically into blocks and warps.
  - By organizing it this way, we can distribute the workload from a kernel across multiple SMs
tensorflow operation - basically an IOU to create and execute a certain CUDA kernel

TODO motivate connection between a tf op, xla operation and cluster and cuda kernel

tl;dr: a “kernel” is some C/C++ code called on the CPU that runs on the GPU to do some operation like add, multiply, etc.

Nvidia Terminology and GPU Mental Model

The GPU has gigabytes of shared RAM in a pool. Sitting around it are a number of multiprocessors, each awaiting instructions to run some parallel computation.

Kernel Launch to SM

SM to result

To execute a kernel within each SM, the following steps need to happen:

Resource Allocation: The SM allocates its resources (registers, shared memory) to the blocks it is executing.
Instruction Fetch and Decode: The SM fetches and decodes instructions from the kernel code.
Warp Scheduling: The SM’s warp scheduler selects warps to execute based on their readiness and available resources.
Execution: The SM’s execution units carry out the instructions for the active warps. This involves arithmetic operations, memory accesses, and synchronization instructions.

To optimize our network (read network as “op graph”), we want to find the most problematic kernels, look at their specific profile (memory bandwidth, occupancy, time, etc), and figure out how to nudge XLA into compiling a more efficient kernel.

Anatomy of an A10

Let’s get more specific. For a Nvidia A10 gpu (Compute 8.6), there is 24GB of shared GPU memory, surrounded by 72 SMs.

When tensorflow launches a cuda kernel from the cpu, the gpu hardware scheduler will dole out kernels Each SM can handle multiple blocks concurrently

There are threads, arrays of threads called blocks, and arrays of blocks called grids.

There is 1 thread for every kernel that is run. Parallelizing ops on the GPU means making many kernels each of which has their own thread.

Blocks can be made of a max of 1024 threads. You can synchonrize the threads within a block, and there is some shared L1 memory between the different threads in a block.

A “streaming multiprocessor (SM)” run the CUDA kernels and handle one or several blocks. Each SM has it’s own registers, which are private to each thread, as well as the aforementioned shared L1 memory between the threads.

“Given that a Turing SM has 65536 registers (it really does, and there is no other number), this allows for 64 registers for each of 1024 threads. You could have two threadblocks of 512 threads each (64 registers per thread), but you could not have two threadblocks of 1024 threads each (regardless of register usage).

And if your GPU thread code used more than 64 registers per thread (quite possible) then you would not even be able to launch one 1024-thread threadblock per SM (you would get a runtime error at the kernel launch). You would have to reduce the total threads, in a single threadblock.” “a single threadblock (on Turing) can consume all the resources.” -https://stackoverflow.com/questions/75828037/connection-between-number-of-registers-in-thread-block-and-in-streaming-multipro

Executing a CUDA kernel

We want to get the most out of our CUDA kernel launches because they are expensive! Here are all the steps that happen everytime we launch a CUDA kernel

The host (CPU) copies data to GPU global memory
- Involves allocating GPU memory and using CUDA functions like cudaMemcpy to transfer the data
TensorFlow creates the parameters for the CUDA kernel
- Including pointers to the data in gpu memory, and the thread and block layout of the operation (more on this later)
TensorFlow launches the cuda kernel to the gpu driver
The GPU’s hardware scheduler decides how to distribute the blocks of threads across the available SMs
- blocks can be executed on any SM, and multiple blocks can be assigned to an SM concurrently depending on workload size and SM availability

“To ensure that all GPU operations are complete before accessing the results, TensorFlow might use synchronization primitives like cudaDeviceSynchronize to wait for the kernel execution and memory transfers to finish.”

GPU Architecture Tuning

Optimizing a Model

The name of the game here is to reduce kernel launch time as much as you can by fusing operations.

All else equal, we want minimal kernel launches that achieve maximum occupancy. We minimize kernel launches by fusing, and we achieve high occupancy by being able to run a bunch of threads in parallel.

The number of registers used per thread can affect how many threads can be active, which is what the occupancy represents! If each thread needs a ton of registers and the number of registers is limited, we won’t be able to run very many threads. Conversely, lower register usage can increase occupancy.

“balance register allocation per thread to maximize occupancy and performance while ensuring that each thread has sufficient resources to perform its computations efficiently.”

Benefits of Fusion

Intermediate results typically need to be written to and read from global memory (GPU DRAM), which is slower than on-chip memory. Fusing operations allows these intermediate results to stay in faster memory, such as registers or shared memory.

Each kernel launch has an associated overhead, including scheduling, dispatching, and synchronization. By fusing operations, this overhead is incurred only once instead of multiple times.

If you overfuse, you will spill the registers? If you underfuse, the kernel launch time will start to overtake the actual on-device compute time

Kernel launch time to compute ratio

Kernel occupancy

To parallelize a matmul, consider which steps can be done indepdently. which will do element-wise multiplication of elements in two vectors and them sum the results.We need all the products in one location to sum them, but computing those products can be split up since each element-wise muliplication is independent.

TODO fact check: There is 1 thread for every kernel that is run. Parallelizing ops on the GPU means making many kernels each of which has their own thread.

TODO relationship between tensorflow kernel, blas kernel, cuda kernel tl;dr: a “kernel” is some C/C++ code called on the CPU that runs on the GPU to do some operation like add, multiply, etc.

“How do I choose grid and block dimensions for CUDA kernels?”

-https://stackoverflow.com/questions/9985912/ how-do-i-choose-grid-and-block-dimensions-for-cuda-kernels

https://stackoverflow.com/questions/75828037/connection-between-number-of-registers-in-thread-block-and-in-streaming-multipro

Hardware Constraints Nvidia A10

https://docs.nvidia.com/cuda/ampere-tuning-guide/index.html#occupancy Constraints related to occupancy: TODO: get a good picture showing a thread with 255 registers, each hold a tiny bit of data. Get another picture showing how threads are collected into blocks TODO make sure you update the explanation below with the correct numbers

100kb shared memory per SM
99kb shared memory per thread block
max 16 thread blocks per SM
max 255 registers per threads
“The register file size is 64K 32-bit registers per SM”?

Let’s simplify a bit. For the A10 and other similar Nvidia Ampere GPUs:

there are 72 streaming multiprocessors (SM), positioned around shared L2 memory on the GPU. Each SM can schedule, dispatch, and execute operations on the GPU. All else equal, the more SMs the greater the parallzability.
each SM has a fixed number of registers - 65,536 to be exact, each holding a 32 bit floating point number. -For a given SM, we can run a number of threads (1024 or less). Each thread gets allocated the same number of registers. Since
there won’t be more than 1024 threads in a single SM, and each thread will have 64 or fewer registers
the threads, called threadBlocks,
block - max 1024 threads
- threads can be layed out in 1d, 2d, or 3d blocks (depending on ?)
- max dimension of each block is (1024, 1024, 64). Note that the product of threads still must be less than or equal to 1024. For example, a thread block could be (8, 8, 16) or (4, 4, 64) or (1, 1024, 1) etc.
- blocksize refers to the number of threads per block. The optimal number can be found using the hardware constraints + trial and error (affectionately called performance tuning)

cuda best practices - https://docs.nvidia.com/cuda/ampere-tuning-guide/index.html#cuda-best-practices

Factors influencing CUDA kernel launches

TensorCore and CUDA Cores

” Tensor Cores are activated when certain parameters of a layer are divisible by 8 (for FP16 data) or 16 (for INT8 data). ¹. 32 bit CUDA cores are the fallback if a layer is not elible to be used by Tensor Cores.

“GPUs perform many computations concurrently; we refer to these parallel computations as threads.” ¹ “Now consider a fully-connected layer. During training, forward propagation, activation gradient calculation, and weight gradient calculation are each represented as a matrix multiply. The GPU divides the output matrix into uniformly-sized, rectangular tiles. Each tile is computed by a thread block…assume one thread block per tile”

Misc

We can also parallelizwe within any given multiprocessor by splitting up matrix multiplications into independent pieces.

-https://www.sciencedirect.com/topics/computer-science/kernel-execution

“Once a grid is launched, its blocks are assigned to SMs in arbitrary order”

“threads in different blocks cannot synchronize with each other. To allow a kernel to maintain transparent scalability, the simple way for threads in different blocks to synchronize with each other is to terminate the kernel and start a new kernel for the activities after the synchronization point.”

https://stackoverflow.com/questions/71507046/what-is-the-difference-between-maximum-number-of-threads-per-multiprocessor-an?rq=1

“A SM (multiprocessor) is a hardware entity. A threadblock is a software entity, basically a collection of threads.

A SM or multiprocessor can have more than 1 block resident. To get full occupancy of an SM that had 1536 max threads, you would need to have something like three 512-thread blocks resident.”

“And an additional question is the relationship between thread and core, is it correct to match thread = core?

A thread represents a sequence of instructions. A “core” in GPU speak is a functional unit in the SM which processes certain instruction types, namely 32-bit floating point add, multiply, and multiply-add instructions. Other instruction types are handled by other (kinds of) functional units in the SM.

A thread will require a core when it has one of those 32-bit floating point instruction types to process. If it happens to have a different instruction to process, say a LD (load) instruction, it will require a different functional unit, specifically, a LD/ST (load/store) unit in that case/example.”

https://stackoverflow.com/questions/12078080/max-number-of-threads-which-can-be-initiated-in-a-single-cuda-kernel?rq=1

“You can launch a grid of up to 65535 x 65535 x 65535 blocks, and each block has a maximum of 1024 threads per block, although per thread resource limitation might restrict the total number of threads per block to less than this maximum.”

“If you choose an illegal execution configuration (so incorrect block size or grid size) the kernel will not launch and the runtime will issue a cudaErrorInvalidConfiguration error message. You can use the standard cudaPeekAtLastError() and cudaGetLastError() to check the status of any kernel launch.”

https://stackoverflow.com/questions/2392250/understanding-cuda-grid-dimensions-block-dimensions-and-threads-organization-s

Want as many threads per block as possible

“blocks are logical, the limit of 768 threads is for each physical processing unit…blocks are logical, the limit of 768 threads is for each physical processing unit” “block are logical, but each block is assigned to a core. if there are more blocks than core, the blocks are queued until cores become free. “

“The kernel is launched like this:

myKernel <<<numBlocks,threadsPerBlock>>>( /* params for the kernel function */ );”

” __syncthreads is a block-wide operation and it does not synchronize all threads.”

Questions

what exactly does the cuda kernel signiture in the TB trace viewer show?
what is the relationship between the number of threads in a block and the dimension? is it as simple as (num_threads, num_threads, ?)
- Answer: the dimension constraints how the threads are layed out, but no matter how they are layed out there can never be more than 1024 threads in a block.
“If threads in a warp diverge (e.g., due to branching), the execution is serialized, which can impact performance” - what does this mean for focalnets that have branching? does xla handle this or does the branching get pushed down to the thread level? what are the main sources of branching in the network and can i eliminate them?
what is the 1kb discrepancy between shared memory of SM and shared memory of thread block?

what does fusing ops do to kernel that ends up running? what exactly are we fusing? fusing the kernel launches so that we don’t need to write the intermediate results back to gpu dram (or is it l2 cache or is it the l1 cache thats shared memory amongst other threads in the block)?

What is warp occupancy - https://docs.nvidia.com/nsight-compute/NsightCompute/index.html#occupancy-calculator

References

Resources

https://stackoverflow.com/questions/43906131/what-are-the-factors-that-affect-cuda-kernels-launch-time

https://tigress-web.princeton.edu/~jdh4/TensorflowPerformanceOptimization_GTC2021.pdf

https://parcorelab.ku.edu.tr/wp-content/uploads/2020/09/MasterTez.pdf

https://arxiv.org/pdf/1811.05213

“Hands on NSight” https://www.cisl.ucar.edu/sites/default/files/2022-06/10_HandsOnNsight_ncu.pdf

“Using Nsight Compute to Inspect your Kernels” https://developer.nvidia.com/blog/using-nsight-compute-to-inspect-your-kernels/

“Matrix Multiplication Background User’s Guide” https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html

“GPU Architecture Fundamentals” https://docs.nvidia.com/deeplearning/performance/dl-performance-gpu-background/index.html#gpu-arch

“Achievd Occupancy” https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexperiments/kernellevel/achievedoccupancy.htm

Roofline Analysis: https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9624-performance-analysis-of-gpu-accelerated-applications-using-the-roofline-model.pdf

https://developer.nvidia.com/blog/optimizing-gpu-performance-tensor-cores/ ↩ ↩²