On the speed and overhead of unrolled loops (stackoverflow answer)

TODO: some of this tf function tracing stuff is good tf/gpu graph background and could be added to other tutorials “A few points to keep in mind about @tf.function (laymanesque):

@tf.function builds a callable graph of the function that it decorates
That graph is referenced to by using a key that is the function signature. This signature is TensorSpec if the input to the function is a Tensor and is a tuple with actual values of the arguments if the input to the function are not tensors
Each time the graph is called the key is checked in all available ‘callable graphs’ and if a match is found then that ‘already built callable graph’ is used. If not, the function is converted to callable graph and then it is called. Building the graph is referred to as tracing the function by documentations

Calling the function with python natives creates a new graph each time it is called. That particular combination of inputs is simply not present as the key whereas in case of a tensor the Tesnsorspec key is same for each tensor with same shape and dtype

If a python iterable is used inside the function then while ‘tracing the function’, the loop will be unrolled to create a gigantic graph. If a tensorflow equivalent like tf.range was used then tensorflow knows how to handle this without unrolling. This unrolling has an overhead the first time the function is run but, unrolled loops are always faster than the loop itself. So, the behavior you will notice is this: With python iterable, as opposed to tensorflow equivalnet (tf.range), The first function run is significantly slow, graph thus created will consume more memory on accelerator but, is significantly faster on all subsequent runs as the graph with python iterable uses unrolled loop.”
- https://stackoverflow.com/a/61744937/7437477

Inference speed up with TFLite

Increase the number of threads to reduce latency (at the cost pf memory / power) https://www.tensorflow.org/lite/performance/best_practices

Consider implementing a custom op https://www.tensorflow.org/lite/guide/ops_custom

Effective Tensorflow 2

https://www.tensorflow.org/guide/effective_tf2https://www.tensorflow.org/guide/effective_tf2 “Do not keep tf.Tensors in your objects These tensor objects might get created either in a tf.function or in the eager context, and these tensors behave differently. Always use tf.Tensors only for intermediate values.

To track state, use tf.Variables as they are always usable from both contexts.” -https://www.tensorflow.org/guide/effective_tf2#do_not_keep_tftensors_in_your_objects

Better performance with tf.function

https://www.tensorflow.org/guide/function

Tensorflow Performance Guide

This page has all sorts of facts that fall out of the particularities of this TF https://cloud.google.com/tpu/docs/TensorFlow-performance-guide.

The page can be sorted into two types of information - freebies and unforced errors.

Freebies

“Transposing a kernel right before sending it to a convolution is free if the transpose only swaps the input and output feature dimensions”

“Transposing any of the operands of a tf.matmul or its result are free.”

Unforced Errors

“The gradient calculation for tf.nn.max_pool may be slower than their tf.nn.avg_pool equivalent. Consider switching from max-pooling to average-pooling when possible.”

“Avoid unnecessary slices and concatenations. Slices and concatenations in a dimension that has been padded is considerably more expensive.”

References

Tensorflow Performance Guide [^perfGuide]: https://cloud.google.com/tpu/docs/TensorFlow-performance-guide.

https://www.tensorflow.org/guide/effective_tf2

https://docs.nvidia.com/deeplearning/frameworks/pdf/TensorFlow-User-Guide.pdf pg 30 “9.3.2. Out of Memory Issue” OOM issues? There are too many operations in the cluster, and all the intermediate tensors can’t fit into memory. “The only way to guranteed address an out-of-memory issue in XLA is by reducing the number of operations in a cluster.”

“Don’t rely on Python side effects like object mutation or list appends” “tf.function works best with TensorFlow ops; NumPy and Python calls are converted to constants.”