Document Overview

This document will provide context on Tensorflow model compilation, walking through the what and a bit of why. By the end, hopefully you will be familiar enough with the terminology anf framework to start playing with XLA compilation to speed up training or inference times.

Prequisites

  • have a vague idea of what a computation graph is
  • understand the difference between training and inference
  • understanding that graph mode and eager execution are different in Tensorflow, and that we will be talking about graph mode

XLA Overview

“The XLA compiler takes model graphs from ML frameworks and compiles them into machine instructions for various architectures.” 1

In other words, XLA takers the python code for a TensorFlow model and converts it into a graph of the operations to run on the GPU and/or CPU, called the compute graph. This conversion process is called model compilation. There are other ways to compile TensorFlow models besides XLA. However, XLA is special in that it not only generates the compute graph, but also optimizes it!

Model Compilation Background

The TensorFlow code you define for a model is not the code that is getting executed during training. Rather, the high level python code defining a graph gets transformed into a lower level, optimized graph that runs on the GPU/CPU/TPU/whatever.

Note: Compilation is an overloaded term in this context. When training a tensorflow model with python, there’s two different compilations happening. One is the compilation of the python file bytecode (.pyc file) by the Python Virtual Machine (PVM), and the other is the compilation of the TF model. For now, we will be referring to the latter when we say “compilation.”

The TF model is compiled with something like model.compile() - an explicit function call to create the computation graph. Regardless of whether we are using XLA or the built in compiler (called “grappler”), the graph gets compiled one way or another.

If we are using XLA with graph mode, we want to create the computation graph and iteratively optimze it. In eager mode, operations are executed immediately as they are called; we never build the computational graph! As a result, a GraphDef is not created automatically, but you can still create one by wrapping the otherwise eager function with @tf.function

Compiling with XLA can speed up model training or execution times by optimizing the compute graph. It does this by combining, aka fusing, some of the linear algebra operations like matmuls and activations, ultimately creating fewer total kernels and savings on kernel launches. This means that the computation graph generated by the first pass of XLA optimization is usually different than the final graph we end up running.

How does XLA go about optimizing the graph? In short: iteratively, and essentially pattern matching in the vain of meta-programming. It goes through a series of steps, each step representing the graph at a slightly different level of abstraction. It can then pattern match to common operation sequences that can be combined. The output of each incremental stage of this optimization process is called the intermediate representation (IR).

The most classic example of this is the fused multiply-add (FMA), which corresponds to applying a fully connected layer with a bias. One thing to notice here is that the output of the matmul is the input to the addition. The unoptimized version of this is first deploying a multiplication kernel to compute the matmul, writing the results back to GPU memory, deploying a separate addition kernel to compute the bias, and then writing that back to GPU memory. Instead, the optimized bersion of this with a FMA launches a single kernel is capable of doing both the multiplication and addition parts of the computation. This saves us from writing the matmul product as an intermediate value and thus speeds up the whole process.

XLA

The Graph def is first created First pass creates the HLO IR Subsequent passes

“A cornerstone of XLA is the HLO (High Level Optimizer) IR, which offers a carefully fixed selected list of operations, mostly orthogonal to each other. It provides an efficient optimizer for computations expressed with this set of operations and generate codes for hardware platforms like CPU, GPU, and TPUs” 2

Definitions

General Definitions

There is a glossary below with some of the key terms: Python Code: This is the high-level code you write to define and train a TensorFlow model. It uses TensorFlow’s Python API to build computational graphs, define operations, and manage data flow.

Bytecode: When you run Python code, the Python interpreter compiles it into bytecode, which is an intermediate representation. This bytecode is then interpreted by the Python virtual machine (PVM). However, for TensorFlow, the Python bytecode mainly orchestrates the operations, rather than performing the heavy computations directly.

TensorFlow Operations (TF ops): These ops are defined in C++ and optimized for performance. The Python API is essentially a wrapper around these efficient implementations. Python code translates into a graph of TF ops, which can be executed independently of the Python runtime. The translation from Python code to these ops involves building this graph, which is then optimized and executed by TensorFlow’s runtime.

Assembly Code: The TensorFlow ops, once defined, are executed by the underlying hardware (CPU, GPU, or TPU). The TensorFlow runtime and libraries (written in C++ and CUDA for GPUs) convert these ops into assembly code, which the hardware understands directly. Assembly code is the low-level human-readable representation of the machine instructions.

Opcodes: These are the actual machine-level instructions that the CPU or GPU executes. They are derived from the assembly code and represent the fundamental operations supported by the hardware, such as arithmetic operations, memory accesses, and control flow instructions.

Tensorflow Definitions

Eager Execution: Eager execution is an imperative programming environment in TensorFlow that evaluates operations immediately, without building graphs. This mode is intuitive and easy to debug, as it executes operations step-by-step as they are called.

Graph Mode: Graph mode in TensorFlow involves building a computational graph of operations before executing them. This mode allows for optimizations and efficient execution, as the entire graph can be analyzed and optimized before running. It is the default mode for TensorFlow 1.x and can be used in TensorFlow 2.x via tf.function.

TensorFlow Kernel: A TensorFlow kernel implements the core computation for a TensorFlow operation (op), usually in C++. Each operation in TensorFlow is associated with one or more kernels that handle the actual computation on different hardware types (CPU, GPU, etc.).

CUDA Kernel: A CUDA kernel is a function written in CUDA C/C++ that runs on NVIDIA GPUs. These kernels are executed in parallel across many GPU cores, providing massive computational power for tasks such as matrix multiplications and convolutions, which are common in TensorFlow operations.

Concrete Function: a specific, optimized, and executable instance of a TensorFlow function defined using tf.function. A concrete function is created with specific input shapes and types, and it contains the compiled and optimized graph ready for execution. Another way to think about the concrete function is that it wraps the original graph function to make it differentiable / tracked by tf.GradientTape3. In other words, “a single graph with a fixed input signature and output.” 4

“make TensorFlow only cluster “small” and fusible operations such as pointwise and reduction operations, by setting: TF_XLA_FLAGS=–tf_xla_auto_jit=fusible”

Compilation Definitions

*LLVM (Low Level Virtual Machine): a compiler used to optimize and generate efficient machine code for different hardware architectures

Intermediate Representation (IR): IR is an intermediate form of the computational graph or operations, which can then be optimized for specific hardware. In other words, the LLVM compiler needs some data structure to optimize, and this is it! Each optimization pass will update the IR, doing optimizations like loop unrolling, constant folding, and dead code elimination.

**HLO IR**: High Level Optimization IR
**MLIR**: Multi-Level IR "is a compiler infrastructure which intends to come with 'battery included', as such it intends to provide all the blocks required to assemble graph optimization and codegen pipelines." [^mlirHLO]

AOT (Ahead-of-Time) Compilation: compiles code into machine code before runtime, baking the ops and weights into an executable for quick inference and reducing runtime overhead.

JIT (Just-in-Time) Compilation: compiles code into machine code at runtime for faster training. JIT compilation optimizes and executes operations on the fly, balancing the need for optimization with the flexibility to adapt to different runtime conditions and input data.

Protobuf: a platform neutral way to serialize structured data, similar to XML. This holds both the compute graph of the network as well as the weights themselves. You may be familiar with the .pb extension, which is a protobuf file.

GraphDef a pb message with a serialized version of the compute graph. The graph is composed of nodes

Graph Def

The graph is composed of nodes, one node per compute op. The data we have get is:

  • name, which is a UID for the node. *TODO are these the names that correspond to the name arg for tf ops?
  • op, which is the type of operation
  • input, which is a list of input tensors, which are a set of other node names
  • attr, is a map of attributes specific to the op, like data types and shapes

Clustering

Manual vs Automatic Clustering

per this forum: https://groups.google.com/g/xla-dev/c/cgMgzdjlOQI When dumping the IR using the TF_XLA_FLAGS, the files are dumped in clusters. These clusters are HloModules, which represent

An HloComputation is basically a function with exactly one output - the root.

s64 data type - signed 64 bit integer, used to represent the shape,

JIT vs AOT

AOT

What AOT compilation means compiling the network “ahead of time”, which is useful for when there is a known machine that will be running inference on a trained model.

How creates an executable, much like any other program that is run like an exe or bash file.

JIT

What JIT compilation means compiling the network “just in time”, which is useful creating a model that can be trained on an unknown device. In order to take advantage of the resources of whatever machine will be running the train, a JIT compile turns the network into machine code as needed during the program’s execution.

tf.function

How using @tf.function is a prerequisite for doing JIT compilation

Forward backward pass

What controls how many kernels get launched? Kernels get (re)launched every batch “Forward Pass:

For each batch, the model performs a forward pass, where it computes the output predictions from the input data. This involves processing through all layers of the model. Each layer can cause one or more kernel launches depending on the complexity of the operations it needs to perform. For example, in a dense layer with a ReLU activation, there could be a kernel launch for the matrix multiplication and potentially another for the activation function. However, these operations may also be fused into a single kernel launch to optimize performance. After the forward pass, the backward pass (backpropagation) computes the gradient of the loss function with respect to every trainable parameter in the model. This process also involves multiple kernel launches. For each layer, gradients are calculated, and this typically involves operations that are similar in computational nature to those in the forward pass but also include derivative computations. The optimizer then updates the model parameters based on these gradients, which involves additional computations (and thus kernel launches)” -chatgpt

TODO replace this with the graph that has no metrics

Dummy network graph compiled with XLA

The graph in Tensorboard is showing an abstraction of the computational steps rather than the actual kernels and data types being implemented under the hood.

iterator - the inputs to the batch. In this example, x_train and y_train are numpy arrays when they are loaded with the mnist.load_data() function, but gets converted to an iterator by the fit() function. sequential - the whole sequential model is abstracted into this one node sparse_categorical_crossentropy - the loss used gradient_tape - records the gradients during backprop Adam - encapsulates the backprop process AssignAddVariableOp - on it’s own, the assign operator is used to update model parameters, aka weights and variables (and variables are pretty much just weights). It gets stuck into the graph as it’s own node, which is helpful for guaranteeing the parameters get updated at the right time. The AssignAddVariableOp takes this a step further combining the update step with an add op, saving overhead from kernel launches and memory access. div_no_nan - used by Adam to prevent division by zero. NoOp - there is no immediate operation done, but they are used to represent control dependencies. This is a way of ensuring the graph is run in the correct order. Identity

If we have op A going into op B and the graph shows A having oututs of NoOps and B having inputs of NoOps, Tensorflow guarantees that A will be finished before B starts. Sometimes ops can be parallelized, but when the output of a layer is the input to the next, this makes it happen. in our simple example where we have an input being fed to a hidden layer and then an output dense layer wu

Here is what it looks like when we run the model using the functional API instead of using the keras Sequential model. Functional version of the dummy XLA graph

Notice it’s pretty much the same except for the Model node. This means that all the code outside of this node is outside of the model itself, doing things in the training loop like handling metric callbacks, calculating the loss for a batch

XLA

At a high level, XLA can group tf ops into clusters. For example, with a dense layer in TensorFlow, the the matrix multiply and bias addition can be fused into a single cluster, removing the need to write the value out to memory* (out of GPU register I think) between doing the multiplication and the addion.

When compiled into executable code, groups of these clusters make up XLA modules. Each module has a UID that can be seen in the filenames of the dotfiles when you use the –xla_dump_hlo_as_dot flag. Since the goal it to create an executable, the hardware target is also in the filename.

The executor type has a UID and is targeting a streaming multiprocessor (SM) on the GPU.

can dump the dotfiles as text or dotfiles to be rendered, but an easier approach is just to use the dump_xla_as_html flag, which will render the dotfiles as a nice graph. You can’t search them in the same way that you can the text files, but this is okay because the filenames correspond to model stages.

Note: You may want to add a line before the flag to clear out any files from previous runs os.system('rm -rf xla_baseline_dump/*')

XLA filenames

Naming reference 5 Once you have the flags added to your script, you should see files getting written to your

module vs numeric - module prefix represents a module compiled with XLA, while a numeric prefix indicates a specific HLO module’s unique ID

The module number corresponds to the cluster. All filenames starting with module_0031 belong to cluster_0, and all filenames starting with module_0048 belong to cluster-1.

For the filenames that start with numerics, this is a timestamp followed by the information on the module/cluster number like the rest of the files. TODO * figure out what these have to do with forward/backwards passes

cluster_0/1/2…/n

Example filename: 1713913952980935.module_24097.a_inference__do_s1_2046274__XlaMustCompile_true_config_proto_6001324581131673121_executor_type_11160318154034397263_.13827.sm_8.6_gpu_after_optimizations

Another example: 1715734514924542.module_0000.a_inference_get_cam_pos_17__XlaMustCompile_true_config_proto_6001324581131673121_executor_type_11160318154034397263_.21.before_optimizations.html

The html files are nice visual renderings of the IR graph that is contained in the text or dotfiles obtained by adding the flag . The files comes in pairs - pre and post optimization.

XLA Modules vs Clusters

“The XLA module is a lower-level representation that includes the specific computation and associated data, whereas the XLA cluster is a higher-level grouping used to identify which parts of the TensorFlow computation graph should be compiled together.” -chatGPT

XLA Module:

Definition: An XLA module is a self-contained unit of computation in XLA that includes the computation graph and any associated metadata. It represents a collection of operations (kernels) that can be compiled and executed together. Purpose: The module encapsulates the program’s functions, making it possible to perform optimizations and generate efficient machine code. Each module is typically tied to a specific computation task or sub-task within your TensorFlow program.

XLA Cluster:

Definition: An XLA cluster is a group of TensorFlow operations that XLA has identified as a unit for compilation. It refers to a subgraph within the TensorFlow computation graph that will be compiled together by XLA. Purpose: Clustering operations allows XLA to apply optimizations across a broader scope of the computation graph, potentially leading to more efficient execution. Relation to Modules: An XLA cluster can correspond to one or more XLA modules, as it is a higher-level concept used during the process of identifying parts of the computation graph to compile.

List of XLA_FLAGS and TF_XLA_FLAGS

A list of can flags can be found in the source code https://android.googlesource.com/platform/external/tensorflow/+/refs/heads/master/tensorflow/compiler/jit/flags.cc#82-158

TF_XLA_FLAGS=--tf_xla_auto_jit=fusible: make TensorFlow only cluster “small” and fusible operations such as pointwise and reduction operations 6 (pg 25). Results in more numerous, smaller clusters potentially adding kernel launch overhead.

From chatgpt: XLA_FLAGS: This is a general environment variable where you can pass multiple flags to the XLA compiler. You can specify options like –xla_hlo_profile to enable HLO-level profiling which is useful for comparing intermediate representation (IR) graphs. TF_XLA_FLAGS: Similar to XLA_FLAGS, but specifically for TensorFlow. Useful flags include –tf_xla_auto_jit=2 which forces aggressive compilation of all eligible ops to XLA, and –tf_xla_cpu_global_jit to enable global JIT compilation on the CPU. TF_XLA_ALWAYS_DEFER_COMPILATION: Setting this to true delays the compilation until the last possible moment. This can be useful for ensuring all possible optimizations are considered before the final graph is compiled. XLA_CPU_MULTI_THREAD_EIGEN: When set to false, this forces the XLA compiler to use single-threaded Eigen operations. This is not directly about optimization but can be useful for debugging performance issues related to threading. TF_DUMP_GRAPH_PREFIX: This environment variable allows you to dump TensorFlow graphs before and after optimization. You can use this to compare the structure and operations of the graphs pre- and post-XLA optimizations. XLA_DUMP_TO: Specifies the directory where the XLA should dump the optimized HLO modules. This is critical for reviewing the optimized versions of your graphs. XLA_DUMP_HLO_AS_TEXT: When set to 1, the HLO module will be dumped as text rather than the binary format, making it easier to read and compare.

Optimization

From the tf documentation, “In an ideal case, your program should have high GPU utilization, minimal CPU (the host) to GPU (the device) communication, and no overhead from the input pipeline.”

Goal is to remove bottlenecks - either memory or compute bound “One way to reduce the cluster sizes is by enabling XLA-Lite. This pretty much guarantees this issue to disappear, as the cluster sizes are reduced rather drastically. Another way is to limit the maximum size of each cluster. By default, there is no upper bound to the size of a cluster. The sizes of clusters can by retrieved with: TF_CPP_VMODULE=mark_for_compilation_pass=2 Look for occurrences of “ *** Clustering info for graph” to get the sizes of the generated clusters. When running with: TF_XLA_FLAGS=–tf_xla_max_cluster_size= where is smaller than the largest cluster size, an upper bound can be found that avoids the out-of-memory. Reducing the mini-batch size is another proactive way an out-of-memory issue can possibly be avoided. This requires no changes to XLA. A smaller mini-batch size, in combination with XLA, can still outperform native TensorFlow." [^tfUserGuide]


TODO: Explain fused multiply add and why it saves a trip in memory

General background - for JIT compilation, the graph will be optimized for the hardware. To do this, XLA looks at the GPU and determines the number of streaming multiprocessors.

tf.functions - work best with TF ops and tensors, “NumPy and Python calls are converted to constants and may cause performance issues or unexpected behaviors.” 4

Tracing:

“tf.function() bridges the gap between eager execution and graph execution by separating the code into two stages: tracing and running. In the tracing stage,” 4 tracing stage is creating the graph

  • “creating a ConcreteFunction for each set of input shapes and dtypes” running stage the graph gets executed

Summary

Compilation is the process of turning python code into low-level operations that run on the machine. In the process from converting python code Tensorflow can handle the details of the compilation for you, or you can deliberately tune and optimize the compilation yourself.

The two main types of compilation are AOT and JIT, often used for inference and training respectively. With JIT compilation, you can optimize the graph to best make use of the resources of a specific computer, right before train time. With AOT compilation, you can bake the graph and weights directly into an executable to run inference on a specific platform.

AOT is taking advantage of the machine instructions /op codes available for a specific architecture, while JIT is considering the memory, cache, and compute capabilities

Here is the same thing a few different ways: “TensorFlow model compilation translates the defined model into a series of low-level operations and kernel calls, optimized for parallel execution on CPUs and GPUs.”

Since TF uses a C++ backend

General Python Compilation

Compiling Tensorflow Models

Python (.py) -> CPython interpreter -> Bytecode file (.pyc) -> Python Virtual Machine (PVM) execution -> TensorFlow with XLA -> HLO (High-Level Optimizer) IR (.hlo) -> Optimized machine code (.bin) -> Execution on target hardware

“Static Graph Compilation: When you build a computational graph and execute it, TensorFlow’s Grappler automatically optimizes the graph. This includes operations like constant folding, operator fusion, and other optimizations.

XLA Compilation: If you enable XLA JIT compilation, the static graph (or parts of it) can be further compiled to optimized machine code by XLA.” -chatGPT

Questions

  • How can I tell how many pairs of before/after files there will be in an XLA dump? Does it depend on the number of clusters or modules and what is the difference between those?
  • Is the right way to think about AOT for JIT pretty much just inference vs training?
  • What does it mean for AOT to be creating an executable? Is it something like since AOT bakes the weights as a constant instead of a variable, and that give syou something?
  • How does eager execution work if it’s not constructing a graph? the python bytecode must still be getting translated to machine code by the interpreter, but TensorFlow no longer uses grappler or xla to turn it into a compute graph?
  • What is true of both ends of the spectrum - eager execution and AOT compilation. Both of them have whatever filetype comes out of the python interpreter
  • What are the similarities and differences between AOT and JIT wrt identifying and optimizing the graph based on memory limitations vs GPU compute limitations
  • Besides fusing. what other optimizations is XLA doing? Vectorized instructions, gpu acceleration, unneeded control dependencies, memory layout optimizations
  • What levers do we have to pull on to influence the computation besides auto_clustering aggressiveness and min cluster size and doing a bunch of manual clustering experiments
  • How can I make profiling different setups more rigorous?
  • What is the XLA GPU backend and how is it different than use XLA with TF? “The XLA GPU backend is competitive with the standard TensorFlow implementation, sometimes faster, sometimes slower.”7
  • What sorts of specifics do you want to know about the machine? GPU ops have different kernel launch overheads and massively parallel compute capabilities. That, plus the size of the memory registers? If you want to wait until the last moment to compile the model so that you know the specifics of is interpreted and then compiled in a series of steps that the python code gets compiled into .
  • What is the difference between TensorFlow clusters and XLA clusters?
  • What does eigen do, and how does making it single threaded help with debugging?
  • Could creating a custom op speed up the Focalnet https://www.tensorflow.org/guide/create_op
  • Does each html/dotfile correspond to one XLA graph? Surely there are many kernel launches per graph

Note: This stuff is still kind of a mess. For example, there’s no easy way to see a list of all the TF_XLA_FLAGS! https://github.com/tensorflow/tensorflow/issues/41763

Notice: “Subject to a few constraints, if there are two adjacent operators in the graph that both have XLA implementations, then they will be compiled into a single XLA computation.”7 For ops to get fused, they must be consecutive. If there is one non-XLA compatible op sprinkled in regularly, it might be preventing fusion from happening where you think it is.

“What is the difference between running normal python code and python code that defines a tensorflow model and training loop?” “The key differences between running normal Python code and Python code that defines a TensorFlow model and training loop lie in execution paradigms, performance optimization, and hardware utilization:

Execution Paradigm:

Normal Python Code: Executes line-by-line in a straightforward, imperative manner using the Python interpreter. TensorFlow Code: Constructs a computational graph (declarative paradigm), which is then executed. This separation of graph construction and execution allows for optimization and efficient execution on various hardware. Performance Optimization:

Normal Python Code: Runs as interpreted bytecode, without significant optimizations for numerical computations. TensorFlow Code: The computational graph undergoes several optimizations, such as operation fusion, pruning, and memory management improvements. TensorFlow’s XLA compiler can further optimize the graph into highly efficient machine code. Hardware Utilization:

Normal Python Code: Primarily runs on the CPU and does not inherently leverage specialized hardware like GPUs or TPUs. TensorFlow Code: Designed to utilize GPUs, TPUs, and other accelerators through backend libraries like cuDNN and TensorRT, providing significant performance gains for large-scale numerical computations. Parallelism and Scalability:

Normal Python Code: Generally runs single-threaded unless explicitly parallelized using libraries like threading or multiprocessing. TensorFlow Code: Can automatically parallelize operations and distribute computations across multiple devices (CPUs, GPUs, TPUs) and machines, enabling large-scale distributed training.” -ChatGPT

XLA Optimzations

Inlining -


XLA Module vs Cluster

“The terms “XLA module” and “XLA cluster” in the context of TensorFlow and XLA IR (Intermediate Representation) refer to different concepts:

XLA Module:

Definition: An XLA module is a self-contained unit of computation in XLA that includes the computation graph and any associated metadata. It represents a collection of operations (kernels) that can be compiled and executed together. Purpose: The module encapsulates the program’s functions, making it possible to perform optimizations and generate efficient machine code. Each module is typically tied to a specific computation task or sub-task within your TensorFlow program. Filename Context: The filename 1715757062945252.module_0000.a_inference_get_cam_pos_17_XlaMustCompile_true_config_proto_6001324581131673121_executor_type_11160318154034397263.21.before_optimizations.dot indicates that this file contains the IR for a specific XLA module before optimizations have been applied. XLA Cluster:

Definition: An XLA cluster is a group of TensorFlow operations that XLA has identified as a unit for compilation. It refers to a subgraph within the TensorFlow computation graph that will be compiled together by XLA. Purpose: Clustering operations allows XLA to apply optimizations across a broader scope of the computation graph, potentially leading to more efficient execution. Relation to Modules: An XLA cluster can correspond to one or more XLA modules, as it is a higher-level concept used during the process of identifying parts of the computation graph to compile.

The XLA module is a lower-level representation that includes the specific computation and associated data, whereas the XLA cluster is a higher-level grouping used to identify which parts of the TensorFlow computation graph should be compiled together. “ -chatGPT

tl;dr - when looking at the html files - we are looking at the XLA modules


Auto vs Manual Clustering Priority

Best explained here how the compilation actually works https://docs.nvidia.com/deeplearning/frameworks/TensorFlow-user-guide/index.html#tf-graph-execution

tl;dr XLA takes a subset of the graph, and makes three ops: 1) a copy of the ops in the subset 2) _XlaCompile - an op to compile the cluster to an XLA binary.The shape must be known for this op. 3) _XlaRun - an op to run the cluster using the binary from the second op

The subset itself is determined either by xla if using automatic clustering or by the programmer if using manual clustering. Automatic clustering is when you let TensorFlow fully decide how to cluster ops. It can be enabled with an environment variable like this: os.environ['TF_XLA_FLAGS'] = ' --tf_xla_auto_jit=1'

“When you use both auto-clustering (via the environment variable) and manual clustering (via @tf.function(jit_compile=True)), TensorFlow will apply both strategies:

Manual Clustering Priority: Functions explicitly marked with @tf.function(jit_compile=True) will always be compiled with XLA, regardless of the auto-clustering settings. Auto Clustering: For other parts of the computation graph not explicitly marked for compilation, TensorFlow will use auto-clustering to determine which parts should be compiled by XLA based on the environment variable setting.” -chatGPT

References

Resources

https://blog.tensorflow.org/2018/11/pushing-limits-of-gpu-performance-with-xla.html

https://openxla.org/xla/tf2xla

https://autodiff-workshop.github.io/slides/JeffDean.pdf

“Fusion of Array Operations at Runtime” (2016) https://arxiv.org/pdf/1601.05400

“How to make TensorFlow models run faster on GPUs “ https://www.youtube.com/watch?v=cPAD9vLKE0c

https://docs.nvidia.com/deeplearning/frameworks/tensorflow-user-guide/index.html#xla-best-practices –> Changes to the TF Graph with XLA https://docs.nvidia.com/deeplearning/frameworks/tensorflow-user-guide/index.html#changes-to-tf-graph

https://blog.tensorflow.org/2019/04/mlir-new-intermediate-representation.html

https://arxiv.org/pdf/2210.04323

https://medium.com/@juniper.cto.aiml.2021/introduction-to-tensorflow-and-keras-intermediate-layer-access-af486ab15725

https://www.machinelearningplus.com/deep-learning/how-use-tf-function-to-speed-up-python-code-tensorflow/

https://www.tensorflow.org/api_docs/python/tf/Graph

https://openxla.org/xla/tools

“Introduction to graphs and tf.function” https://www.tensorflow.org/guide/intro_to_graphs

“Better performance with tf.function” https://www.tensorflow.org/guide/function

https://github.com/christianversloot/machine-learning-articles/blob/main/tensorflow-eager-execution-what-is-it.md

https://blog.tensorflow.org/2021/03/a-tour-of-savedmodel-signatures.html TF Signature Defintions —-

Bonus

Since there’s already plenty of tutorials *TODO add tutoria;s If you want a tutorial, here are some better ones:

Control flow documentation in TF repo: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/autograph/g3doc/reference/control_flow.md#if-statements

TBH they explain XLA better than I do so link this somewhere: https://openxla.org/xla/architecture

https://whatdhack.medium.com/tensorflow-graph-graphdef-grappler-xla-mlir-llvm-etc-615191e96ebc

“XLA does not JIT build convolutions and gemms, but picks appropriate kernels from libraries” -https://whatdhack.medium.com/tensorflow-graph-graphdef-grappler-xla-mlir-llvm-etc-615191e96ebc

TODO add to glossary Tf Module, compute Graph, GraphDef, XLA HloInstruction — XLA Input IR

  1. “XLA Architecture: How it Works” https://openxla.org/xla/architecture#how_it_works 

  2. _https://github.com/tensorflow/mlir-hlo 

  3. https://www.tensorflow.org/api_docs/python/tf/types/experimental/ConcreteFunction 

  4. https://www.geeksforgeeks.org/tf-function-in-tensorflow/  2 3

  5. https://docs.graphcore.ai/projects/TensorFlow-user-guide/en/latest/TensorFlow/logging.html#xla-file-naming 

  6. https://docs.nvidia.com/deeplearning/frameworks/pdf/TensorFlow-User-Guide.pdf 

  7. https://docs.w3cub.com/tensorflow~guide/performance/xla/jit  2