Optimising Transformer Training

Training a large transformer model can be a multi-day, if not multi-week, ordeal. Especially if you’re using cloud compute, this can be a very expensive affair, not to mention the environmental impact. It’s therefore worth spending a couple days trying to optimise your training efficiency before embarking on a large scale training run. Here, I’ll run through three strategies you can take which (hopefully) shouldn’t degrade performance, while giving you some free speed. These strategies will also work for any other models using linear layers.

I wont go into too much of the technical detail of any of the techniques, but if you’d like to dig into any of them further I’d highly recommend the Nvidia Deep Learning Performance Guide.

Training With Mixed Precision

Training with mixed precision can be as simple as adding a few lines of code, depending on your deep learning framework. It also potentially provides the biggest boost to performance of any of these techniques. Training throughput can be increase by up to three-fold with little degradation in performance – and who doesn’t like free speed?

Traditionally, neural network training is done in single precision (FP32), where numbers are stored as 32-bit floating points. Mixed precision training involves performing some operations in 16-bit floating points (FP16) while maintaining full precision for critical steps. The use of FP16 offers dual benefits: it decreases memory usage and shortens training or inference time. The reduced memory usage allows for training larger models or using larger batch sizes, while the faster data transfer speeds of FP16 and the increased efficiency of FP16 arithmetic operations speed up the training process.

With the reduced range of FP16, it’s necessary to implement a technique to prevent activations from underflowing to 0, which can lead to divergent training loss. This is achieved by scaling the loss by a specific factor, ensuring the full dynamic range of FP16 is utilised and preventing underflow. This process is typically automated in frameworks that support mixed precision (e.g., PyTorch and TensorFlow), which select an appropriate scaling factor every few iterations.

Optimising Hyper-Parameter Choice

Not all parameters are created equal. The architecture of modern GPUs favours certain model hyper-parameters to ensure efficient utilisation of all streaming-multiprocessors (SMs), which are the core processing units in a GPU. This ensures that matrix multiplication operations can be efficiently partitioned across all available cores.

First, we need to ensure that we have sufficiently sized tensors in our matrix multiplication to allow parallelisation across all the SMs. In the case of transformers, this usually isn’t an issue, but you should ensure a batch size of at least 128 to make sure sufficient threads can be created to partition the operation.

Secondly, we need to ensure that the matrix multiplication operations can be efficiently split into individual threads. This involves making sure that all parameters (batch size, input and output layer size) are a multiple of a certain factor, dependent on the GPU architecture and floating point precision used. In a transformer with a decoder, this may involve padding the vocabulary size to ensure efficient use of GPU cores in the final SoftMax layer.

  • For A100 GPUs: Multiples of 32 (TF32), 64 (FP16), 128 (INT8)
  • For other architectures: Multiples of 4 (TF32), 8 (FP16), 16 (INT8)

Using Multiple GPUs

Using multiple GPUs, if accessible, can significantly accelerate training or enable the training of models too large for a single GPU.

For models that fit on a single GPU but need faster training, I recommend using the multi-GPU support in PyTorch (DataParallel) or a similar feature in your deep learning framework. This approach typically yields a substantial performance increase, though training time reduction is not linear with the number of GPUs used.

For models too large for a single GPU, partitioning across several GPUs becomes necessary. In such cases, the DeepSpeed framework is highly recommended for its efficient model partitioning with ZeRO.

Author