ncu

Here is an example call to nvidia computes profiler: /usr/local/cuda-11.2/nsight-compute-2020.3.0/target/linux-desktop-glibc_2_11_3-x64/ncu --export "/home/timothy_holdsworth/Documents/NVIDIA Nsight Compute/dummy-xla" --force-overwrite --target-processes application-only --replay-mode kernel --kernel-regex-base function --launch-skip-before-match 0 --section LaunchStats --sampling-interval auto --sampling-max-passes 1 --sampling-buffer-size 33554432 --profile-from-start 1 --cache-control all --clock-control base --apply-rules no --import-source no --launch-skip 1500 --check-exit-code yes le_cml/bin/python3.8 mains/train_endtoend.py --execution_mode "local" --config "rapid_endtoend_test"

ncu args

To control which data you are looking for in your profile –section LaunchStats - shows how kernels are launched –section Occupancy - shows how effectively kernels are using the GPU –section SpeedOtfLight - how close you are to hitting the GPU’s theoretical limits, helps identify whether memory bound or compute bound.

To control which data you profile –launch-skip - how many kernel launches to skip before profiling sampling-max-passes - how many times you profile the same kernel; default is 8 **

To control where and how the profile data is saved –export - directory to write the profiling data to –force-overwrite - Whether to force overwrite the profiling graphs if there is already data in the graph from the export args

Get familiar with NVIDIA terminology -

mainly threads and blocks https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/#:~:text=Figure%201%20shows%20that%20the,function%20executed%20on%20the%20GPU.

Nvidia kernel names

FillPhiloxRandomKernelLaunch - nodes with random numbers like dropout, or in our case for initializing weights Sub_GPU_DT_FLOAT_DT_FLOAT_ker - element wise subtraction on floating point tensors Mul_GPU_DT_FLOAT_DT_FLOAT_ker - element wise multiplication on floating point tensors. Think applying the weights on the forward pass, gradient propagation on the backwards pass. This is also the kernel for applying dropout after determining which weights to drop with the FillPhiloxRandomKernelLaunch layer AddV2_GPU_DT_FLOAT_DT_FLOAT_k - element wise addition on floating point tensors. Adding the bias to the result of a matmul

Achieved Occupancy

https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexperiments/kernellevel/achievedoccupancy.htm Basically the number of threads being actually being run divided by maximum numbr of threads that could be run in theory. A warp is a group of 32 threads that get exxecuted on the same SM together. “Each block of a kernel launch gets distributed to one of the SMs for execution…the number of blocks which can execute concurrently on an SM is limited” “the upper limit for active warps can be raised by increasing the number of warps per block (defined by block dimensions), or by changing the factors limiting how many blocks can fit on an SM to allow more active blocks.”. In other words the occupancy tells you something about how much of the current confuguration is being used. Crucially, the actual metric we care about is the batch execution time! A low occupancy can help us notice that the current configuration of threads and blocks is not ideal, but having a high occupancy does not mean it’s a good configuration necessarily either

“An early step of kernel performance analysis should be to check occupancy and observe the effects on kernel execution time when running at different occupancy levels.”

Questions

What is warp occupancy - https://docs.nvidia.com/nsight-compute/NsightCompute/index.html#occupancy-calculator

References

Resources

https://stackoverflow.com/questions/43906131/what-are-the-factors-that-affect-cuda-kernels-launch-time

https://tigress-web.princeton.edu/~jdh4/TensorflowPerformanceOptimization_GTC2021.pdf

https://parcorelab.ku.edu.tr/wp-content/uploads/2020/09/MasterTez.pdf

https://arxiv.org/pdf/1811.05213

“Hands on NSight” https://www.cisl.ucar.edu/sites/default/files/2022-06/10_HandsOnNsight_ncu.pdf

“Using Nsight Compute to Inspect your Kernels” https://developer.nvidia.com/blog/using-nsight-compute-to-inspect-your-kernels/

“Matrix Multiplication Background User’s Guide” https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html

“GPU Architecture Fundamentals” https://docs.nvidia.com/deeplearning/performance/dl-performance-gpu-background/index.html#gpu-arch

“Achievd Occupancy” https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexperiments/kernellevel/achievedoccupancy.htm

Roofline Analysis: https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9624-performance-analysis-of-gpu-accelerated-applications-using-the-roofline-model.pdf