Table Of Contents

This is a glossary I made for myself while getting a lay of the land. Hopefully it helps you too!

Python & General Definitions

There are a few different comilation paths happening when you run a TensorFlow model, including the python code itself and the TensorFlow code. More on that in another post XLA for Dummies. For now, we will stick to definitions.

Host: The CPU. With TensorFlow, the host is responsible for doing data transfer between CPU and GPU launching CUDA kernels on the GPU.

Device: The GPU (if available). In TensorFlow, the device executes the CUDA kernels and performs the actual computations.

Intermediate Representation (IR): IR is a data structure for the computational graph which can then be optimized for specific hardware. In other words, the compiler needs some data structure to optimize, and this is it!

Backend: takes optimized IR and produces actual machine code or API calls for the target hardware. For example, XLA has different backends - the GPU backend generates CUDA/ROCm code and the CPU backend generates vectorized x86/ARMinstructions. Same input graph, differentbackends = different machine code.

CPython: The most popular implementation of Python, written in C. It acts as both interpreter and compiler, compiling Python into bytecode and then interpreting that bytecode with the Python virtual machine.

Python Virtual Machine (PVM): The runtime environment inside CPython, which interprets the bytecode generated by the CPython compiler and executes it.

Bytecode: An intermediate representation for normal python code that is platform (read hardware) independent

Assembly Code: a human readable, text-based representation of the machine instructions, aka the last stop before binary-ville. The TensorFlow runtime and libraries (written in C++ and CUDA for GPUs) convert TensorFlow ops into assembly code, which the hardware understands directly.

Instruction Set Architecture (ISA): The set of instructions that a CPU or GPU can execute. It defines the machine code instructions, registers, and data types supported by the hardware. Different hardware architectures have different ISAs, such as x86 for Intel CPUs and PTX for NVIDIA GPUs.

Opcodes: These are the actual machine-level instructions that the CPU or GPU executes. They are derived from the assembly code and represent the fundamental operations supported by the hardware, such as arithmetic operations, memory accesses, and control flow instructions.

Machine Code - instructions that the CPU can process directly. It’s written in binary, so not human readable

Machine Learning Definitions

Compute Graph (Computational Graph): A directed graph that represents your ML model as nodes (operations like matmul, add, relu) connected by edges (tensors flowing between ops). When you write y = tf.nn.relu(tf.matmul(x, W) + b), it builds a graph with nodes for matmul, add, and relu.

TensorFlow: A framework for defining compute graphs (amongst other things)

frozen computation graph: a compute graph where all the weights are baked in as constants rather than variables. This is what you get when you export a trained model for inference; the graph contains both the operations and the actual weight values frozen together. The compiler can go wild with optimizations here since it knows these values will never change.

Protobuf: Serialization format (binary) that holds both the compute graph of the network as well as the weights themselves in files using the .pb extension in TensorFlow. More technically, it’s a way to serialize structured data like XML, but way less (human) readable.

TensorFlow Backend: Converts TensorFlow operations (TF ops) into lower-level representations and handles the execution of these operations on the target hardware.

TensorFlow Operations (TF ops): basically an IOU to create and execute a certain CUDA kernel if using the GPU. Instead of being cashed in by the normal Python interpeter, TensorFlow’s C++ runtime bypasses it and executes pre-compiled CUDA kernels on the GPU for the actual computations.

TensorFlow Kernel: A TensorFlow kernel implements the core computation for a TensorFlow operation (op), usually in C++. Each operation in TensorFlow is associated with one or more kernels that handle the actual computation on different hardware types (CPU, GPU, etc.).

Eager Execution: Eager execution is an imperative programming environment in TensorFlow that evaluates operations immediately, without building graphs. This mode is intuitive and easy to debug, but there isn’t any graph optimization happening.

Graph Mode: Graph mode in TensorFlow involves building a computational graph of operations before executing them. This mode allows for optimizations and efficient execution, as the entire graph can be analyzed and optimized before running. It is the default mode for TensorFlow 1.x and can be used in TensorFlow 2.x via tf.function.

tf.function: A TensorFlow decorator that converts a Python function into a TensorFlow graph. This allows for optimizations and efficient execution of the function, enabling it to run faster and on different hardware backends. It also supports automatic differentiation for training models.

Concrete Function: A compiled version of a tf.function optimized for specific input shapes and types. When you first call a tf.function with inputs (e.g., tensors of shape [32, 128]), TensorFlow “traces” your Python code and creates a concrete function just for that input signature. If you later call it with different shapes, TensorFlow creates another concrete function. This lets tf.function handle different inputs efficiently while still running fast optimized graphs.

GraphDef: A protobuf message containing the serialized representation of a TensorFlow computational graph. It stores the complete graph structure including all nodes (operations), their connections (edges/tensors), and attributes like shapes and data types. GraphDef is TensorFlow’s way of persisting and transferring computation graphs - when you save a model or convert between eager and graph modes, you’re often working with GraphDef. Each node in a GraphDef contains: a unique name, the operation type, input tensor references, and operation-specific attributes. This serialized format allows graphs to be saved, loaded, optimized, and executed across different environments.

Compilation Definitions

Compilation - the process of transforming high-level code into a lower-level representation that can be executed by hardware. Models can be compiled to run on different devices like CPUs and GPUs, and the same exact model code will be compiled differently for different hardware

Compilation Backend: The underlying system or framework that performs the compilation and optimization of the computational graph. This can include XLA, MLIR, or other compiler frameworks that target specific hardware architectures.

LLVM (Low Level Virtual Machine): A general compiler used by ML frameworks like TensorFlow use to generate optimized machine code. LLVM acts as a middle layer - it takes intermediate representations (IR) from various sources and optimizes them for different hardware.

XLA (Accelerated Linear Algebra): A ML specific compiler that optimizes TensorFlow computations into optimized machine code targeting various hardware backends

Static Shape: Refers to the fixed dimensions of tensors in a computation graph. Static shapes allow for more efficient memory allocation and optimization, as the compiler can make assumptions about the size and layout of data.

Dynamic Shape: Refers to tensors whose dimensions can change at runtime. Dynamic shapes introduce additional complexity in optimization and execution, as the compiler has to account for varying sizes and layouts of data.

Platform (In)dependt: Whether or not the execution of code depends on the specific hardware. Platform independent is a higher level and, while being more generalizable and abstact, does not use the constraints of the hardware to make decisions and thus does not optimize for it. Platform dependent code is lower level and optimized for the specific hardware, but is less generalizable and portable.

AOT (Ahead-of-Time) Compilation: compiles code into machine code before runtime, baking the ops and weights into an executable for quick inference and reducing runtime overhead.

JIT (Just-in-Time) Compilation: compiles code into machine code at runtime for faster training. JIT compilation optimizes and executes operations on the fly, balancing the need for optimization with the flexibility to adapt to different runtime conditions and input data.

Compilation Target: The specific hardware or software environment for which the computational graph is being compiled. This can include CPUs, GPUs, TPUs, or specialized accelerators.

Compilation Pass: A single transformation or optimization step applied to the computational graph during the compilation process. Each pass can improve performance, reduce memory usage, or enhance parallelism.

Lowering: The process of translating high-level operations into progressively more hardware-specific instructions. Each step gets closer to what the hardware actually understands. The tricky part is that different hardware needs different lowering - a matmul on a CPU might become SIMD instructions while on a GPU it becomes a cuBLAS call or custom kernel. This is why the same model code can run on different devices: the compiler handles all the device-specific lowering for you. In practice, lowering might look like: tf.matmul → Hlo dot operation → tiled loop nest → CUDA kernel launch → PTX instructions → actual GPU machine code.

Tracing: The process of converting a Python function into a TensorFlow graph by recording the operations performed during execution. This allows TensorFlow to optimize and execute the function in a way that is aware of the hardware and data constraints.

XLA Definitions

XLA Backend: A specific backend within TensorFlow that compiles and optimizes TensorFlow operations into efficient machine code for various hardware platforms, such as CPUs, GPUs, and TPUs. XLA stands for Accelerated Linear Algebra, and it provides a way to optimize TensorFlow computations by lowering the high-level operations into more efficient representations.

XLA Compilation: The process of transforming TensorFlow operations into optimized machine code using the XLA compiler. This involves converting high-level TensorFlow operations into a lower-level representation (Hlo IR), applying various optimizations, and generating target-specific machine code.

Hlo Module: The entire compiled graph or program that XLA generates from a TensorFlow computation. It’s a container that holds the Hlo compute graph and is used throughout XLA’s entire compilation pipeline. An Hlo module exists when first created from the TensorFlow input, during optimization passes when it getes modified, and in its final form after optimization.

XLA Module: A more casual way of referring to the Hlo Modules

HloComputation: A function or subgraph within an HloModule. Each computation is a graph of HloInstructions with defined inputs and outputs.

HloInstruction: The actual operations in XLA’s IR - these are the nodes in your computation graph. Each instruction does one thing: dot product, reshape, add, etc. They’re what Hlo optimization passes actually transform. When you see “Hlo.dot” or “Hlo.add” in compiler outputs, those are HloInstructions

NOTE: The hierarchy is: HloModule → HloComputations → HloInstructions

Hlo IR (High Level Optimization Intermediate Representation): XLA’s specialized intermediate representation for ML which represents operations in a hardware-agnostic way. XLA has a fixed set of operations carefully selected to be orthogonal to each other 1. When TensorFlow uses XLA compilation, XLA optimizes Hlo through multiple passes (fusion, layout assignment, etc.) before lowering it to machine code. An example: tf.matmul → Hlo.dot operation → fused kernel.

MLIR (Multi-Level Intermediate Representation): A general compiler construction kit that lets you mix and match different levels of abstraction in the same system. You might have a high-level “TensorFlow dialect” that understands tf.matmul, a mid-level “linalg dialect” that thinks in terms of loop nests and tiling, and a low-level “LLVM dialect” that’s all about registers and memory.

The real power is that we can gradually lower between these levels of representation. So a tf.nn.conv2d might become a linalg.conv, then a bunch of nested loops, then vectorized instructions while other parts of the graph are still at higher levels.

NOTE: Hlo IR is a fixed representation inside XLA. MLIR is a framework where Hlo can be just one dialect among many. Some TensorFlow models compile via XLA (using Hlo IR directly), others via MLIR (which might use an Hlo dialect internally).

Optimized Hlo: The final optimized version of the Hlo IR after all optimization passes have been applied. This is the representation that will be lowered to machine code for execution on the target hardware.

NVIDIA / GPU Definitions

Nvidia Software Definitions

CUDA: The NVIDIA API for running kernels on the GPU

cuDNN: a NVIDIA library with optimized implementations of standard operations like as convolution, pooling, etc. TensorFlow uses it to accelerate training and inference on NVIDIA GPUs.

cuBLAS: a NVIDIA’s library for GPU-accelerated implementations of basic linear algebra operations such as matrix multiplication, vector addition, and dot products. XLA can use cuBLAS to optimize matrix ops when compiling models.

CUDA Kernel: A CUDA kernel is a function written in CUDA C/C++ meant to be run in parallel across many GPU threads. It sits below the TensorFlow code and gets compiled by NVCC into PTX, which ultimately gets turned into cuda binary machine code. Each kernel specifies the computation performed by a single thread, but when launched, it executes across many threads simultaneously. Kernels are launched from the host (CPU) and execute on the device (GPU).

NVCC The main CUDA compiler. It orchestrates the entire CUDA compilation pipeline, splitting code into host (CPU) and device (GPU) parts, sends host code to your system compiler (gcc/clang), and handles device code compilation through the pipeline: CUDA C++ → PTX → cubin.

PTX (Parallel Thread Execution): An ISA used by NVIDIA GPUs that’s as an intermediate representation between high-level CUDA C++ kernels and GPU machine code. It’s basically CUDA’s equivalent to LLVM IR.

PTX Assembly: A human-readable representation of PTX instructions used to write CUDA kernels. It’s similar to assembly code but specifically for NVIDIA GPUs.

CUDA Binary: The compiled version of PTX assembly code that can be executed on NVIDIA GPUs. It is a lower-level representation that the GPU can understand and execute directly.

PTX Assembly Source (pxtas): The CUDA assembler that transforms PTX into CUDA binaries

NVIDIA GPU Hardware Definitions

GPU Register: The fastest memory on the GPU, private to each thread. Holds 32-bit values for local variables and intermediate computations.

GPU Thread: The smallest unit of execution on the GPU. Each thread operates on its own piece of data, has its own registers (for temporary variables, program counter, etc.), and executes a kernel. Designed to run in parallel with thousands of other threads.

Warp: A group of 32 threads that execute in lockstep - all threads in a warp execute the same instruction simultaneously on different data. This is NVIDIA’s fundamental unit of execution. Each thread within a warp maintains its own registers.

Thread Block: A 2D or 3D collection of threads (up to 1024) that can cooperate through shared memory and synchronization. All threads in a block run on the same streaming multiprocessor.

Grid: A 2D or 3D collection of thread blocks that execute a kernel. The grid represents the entire problem space.

Streaming Multiprocessor (SM): The actual compute unit on the GPU - there are ~hundred of these on a modern GPU. Each SM has its own registers, shared memory, and executes multiple warps concurrently. Kernels get distributed across SMs for parallel execution.

Shared (GPU) Memory: allows multiple threads within a block to share data. This is distinct from the L1/L2 caches on the GPU.

Occupancy: The ratio of active warps to the maximum possible warps per SM. Higher occupancy often (but not always) means better performance by hiding memory latency.

Compilation Tools

Fusion: An optimization technique that combines multiple operations into a single operation. This reduces the overhead of launching multiple kernels and can lead to better cache utilization.

Loop Unrolling: An optimization technique that expands loops to reduce the overhead of loop control and increase instruction-level parallelism. This can lead to better performance by allowing more operations to be executed in parallel. For example, if you have a loop that iterates 10 times, unrolling it might transform it into 10 separate operations that can be executed in parallel.

Constant Folding: An optimization technique to evaluate constant expressions at compile time rather than runtime. For example, if you have an operation like 2 + 3, the compiler can replace it with 5 before executing the code, reducing unnecessary computations during runtime.a

Common Subexpression Elimination (CSE): Replace things that are computed multiple times with a single computation. For example, if you have an expression like a + b that appears multiple times in your code, CSE will compute it once, call it c reuse c everywhere a + b used to be.

Tiling: A technique that breaks down large operations into smaller, manageable pieces (tiles) to improve memory access patterns and parallelism. Tiling can help reduce cache misses and improve overall performance by ensuring that data is processed in chunks that fit well within the available memory hierarchy.

Refernces & Resources

  1. https://github.com/tensorflow/mlir-hlo