Understanding GPU parallelization in deep learning

Deep learning has proven to be the season’s favourite for biology: every other week, an interesting biological problem is solved by clever application of neural networks. Yet, as more challenges get cracked, modern research shifts more and more in the direction of larger models — meaning that increasing computational resources are required for training. Unsurprisingly, NVIDIA, the main manufacturer of GPUs, experienced a significant jump in their stock price earlier this year.

Access to compute is not enough to train good neural networks. As soon as multiple cards enter into play, researchers need to use a completely different paradigm where data and model weights are distributed across different devices — and sometimes even different computers. Though these tools start to be crucial for successful computational biology research, they are generally unknown to researchers. Hence, in this blogpost, I would like to provide a really brief introduction to multi-GPU training.

A word about parallel programming

I have always liked to think about parallel programming in the lines of working in a high school group project. On the initial meeting, it all seems simple. The project is split across all the students — A does X, B does Y, and so on — and then a date in the future is agreed upon where the project will be “put together”. When that happens, you agree, each piece will make their way into a nice PowerPoint presentation, and everyone will get full marks. Wasn’t life easy in high school?

Well, I don’t know about you, but it rarely worked like that for me. There was always something that would go wrong.

The central problem is that at least one member will take longer than originally anticipated, meaning that the rest of the team will have to sit idle until something is completed. Or there will be a communication mistake: something won’t be explained well, and won’t be integrated into the final picture. While computer processors look quite different to students — as a start, they all have more or less the same throughput, and can communicate quite efficiently — they tend to have the same problems of coordination and communication than a team of tenth-graders.

There are other problems that go around in parallel programming. For example, two participants may at the same time use the same resources — say, both of them are trying to write on a whiteboard at the same time —, what we call a race condition. Or there may be two team members that are both waiting for feedback from each other before moving forward and thus unable to make any progress, what we call a deadlock. However, these are generally not a problem in modern deep learning frameworks, so I will avoid them.

In practice, the general advice behind parallel programming is: try to limit any dependencies between independent workers.

Data parallelism: divide and conquer

The ideal setting in any parallel programming exercise is when we can perfectly divide the workload into independent packages, with minimal communication. This occurs when we have relatively small deep learning models, which can fit in a single GPU, and we have a large amount of training data. In this setting, we can — yes — afford to split the workload equally between different devices and then “put it all together”.

What occurs, essentially, is that we copy the weight of the model across all GPUs, and run backpropagation to compute the gradients. Once the gradients are computed, they are then broadcasted across all the GPUs, averaged, and used to update the weights. In use cases like training language models, where we make ample use of gradient accumulation to achieve ultra-large batch sizes, data parallelism can be an incredibly successful technique, requiring synchronization less often than once per minute.

Accordingly with the ease of the method, we can implement data parallelism quite easily using PyTorch:

import torch.nn as nn
import torch

model = nn.Linear(10, 10)
if torch.cuda.device_count() > 1:
    print("Using", torch.cuda.device_count(), "GPUs!")
    model = nn.DistributedDataParallel(model)

model.to('cuda:0')  # Move the model to GPU

Note that nn.DistributedDataParallel works in both the single-node setting, where all the GPUs are connected to the same motherboard, and in the distributed training setting, where there may be GPUs in different machines (and even in different countries). The PyTorch documentation recommends to use nn.DistributedDataParallel instead of nn.DataParallel, even in a single-node setting.

MLOps frameworks also enable you to use data parallel strategies. For example, in PyTorch Lightning, you can pass accelerator='dp' (or 'ddp' if you are having multiple GPUs across multiple machines) to your Trainer, alongside the number of GPUs, to automatically implement data parallelism. Be careful to read the docs, to make sure that you are using the right batch size — for example, PyTorch Lightning will pass whatever your batch size is to every GPU.

Model parallelism: share the load

While data parallelism is quick and efficient, it doesn’t solve an ever-increasing problem: deep learning models that are so massive that they can’t be fit in a single GPU, not even in the latest devices with up to 80 GB of dedicated memory.

Model parallelism is a parallel computing technique where different parts or sections of a neural network model run on different devices or nodes. This approach is particularly beneficial when dealing with very large models that don’t fit entirely within the memory of a single GPU. Instead of dividing the data, as in data parallelism, model parallelism divides the model itself. This ensures that even models with a vast number of parameters can be trained by leveraging the combined memory of multiple GPUs.

Here is an example of how this would work in PyTorch:

import torch.nn as nn
import torch

class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.part1 = nn.Linear(10, 10).to('cuda:0')
        self.part2 = nn.Linear(10, 10).to('cuda:1')

    def forward(self, x):
        x = self.part1(x.cuda(0))
        return self.part2(x.cuda(1))

model = SimpleModel()

Similarly, you can use model parallelism techniques in PyTorch Lightning via FSDP, simply by setting accelerator='fsdp' in your trainer.

Advanced tricks

There is a clear problem with model parallelism though: how do we ensure that we efficiently divide the parameters of the model between different devices, so that we minimise idle time? Part of the answer is that we design the model alongside the compute architecture — but of course there is quite a bit more going on.

The necessity of training exceedingly large models has given rise to optimization tools like Microsoft’s DeepSpeed. This library offers both distributed training and efficient model parallelism. A standout feature is its Zero Redundancy Optimizer (ZeRO), which partitions optimizer states, gradients, and parameters across GPUs, dramatically conserving memory and reducing data movement. Moreover, DeepSpeed brings forth pipeline parallelism, where model layers are strategically broken down into stages. This layer-wise pipelining ensures GPUs stay busy, with one handling the forward pass of a mini-batch while another tackles the backward pass of a preceding one.

The allure of DeepSpeed isn’t just in its memory efficiency. Its seamless integration with PyTorch, support for mixed precision training, custom CUDA kernels, and innovations like sparse attention patterns make it a game-changer. However, the landscape of model parallelism isn’t limited to DeepSpeed. NVIDIA’s Megatron-LM focuses on parallelizing transformer models, Google’s Mesh-TensorFlow provides a unique way to describe tensor computations facilitating both model and data parallelism, and HuggingFace has also stepped into the arena with model parallelism solutions. When it comes to pushing the boundaries of deep learning, these advanced tools and techniques are paving the way for unprecedented innovation.

Conclusions

Central to the deep learning revolution is the ability to deploy massive amounts of compute to process huge amounts of data. Yet, access to compute is not enough — deploying careful engineering to efficiently distribute training across large compute architectures is a key ability towards furthering the reach of deep learning models. The techniques discussed in this post will give you the basics to make the most of your GPUs.

Author