How Distributed Training Makes Large Models Possible: DDP vs FSDP
Over the past year, a big part of my work has involved training and fine-tuning large foundation models in different domains, including single-cell foundation models and vision and language transformer models. Along the way, I’ve worked with various distributed training methods, and in this article, I’ll walk through Distributed Data Parallel (DDP) and Fully Sharded Data Parallel (FSDP) and show you how to decide which one to use.
We’ll start with the basics of why we need distributed training in the first place, then look at how DDP and FSDP each approach the problem. I’ll then show results from experiments I ran comparing the two on a vision foundation model (DINOv2) and an LLM (Mistral-7B). From there, we’ll see where each one shines and where it breaks down.
For most models, training on a CPU is enough, and if you want to speed things up, you move to a GPU. But foundation models are a different story. The datasets are massive, and the models themselves can have billions, or sometimes, even trillions of parameters. At that scale, a single GPU isn’t enough, either because the model doesn’t fit in memory or because training would take an unreasonably long time. When we say the model doesn’t fit in memory, we’re referring to the model parameters, their gradients, and the optimizer states used during training. All of these must be stored during training, and together, they require more GPU memory than the raw parameter count alone might suggest.
For example, a 7B parameter model with Adam optimizer needs roughly 14GB for parameters (using fp16, although bf16 is more commonly used these days, but I digress), another 14GB for gradients, and 56GB for optimizer states (because Adam stores two fp32 values per parameter). That’s around 84 GB before you even account for activations. So then, how do we train these models? Instead of relying on a single device, we spread the workload across multiple GPUs. There are different ways to do this, and not all handle memory the same way. Two of the most commonly used approaches in the PyTorch ecosystem are Distributed Data Parallel (DDP) and Fully Sharded Data Parallel (FSDP).
Distributed Data Parallel (DDP)
DDP is the simpler of the two and is usually the first thing that people try (which was the case for me as well). The idea is straightforward: copy the model to each GPU, split the training data across them, and have each GPU process its own batch. After the backward pass, the gradients are averaged across all GPUs using an operation called all-reduce, so every copy of the model stays in sync.
Let’s say you have 4 GPUs and a batch size of 32 per GPU. Each GPU does forward and backward passes on 32 samples. All of them start with identical weights and synchronize gradients after each backward pass. The effective batch size is then 128 samples, and we can achieve close to 4x the throughput of a single GPU. But since there is communication overhead, the time required to process these wouldn’t be exactly 4x less, and depending on the communication protocol used, that could impact things as well.
So for DDP, it’s simple. Your model doesn’t change, your training loop doesn’t change much, and you get near-linear speedup. But because you have to replicate the model on every GPU, if the model doesn’t fit on one GPU, DDP is of no help. This is where FSDP comes in.
Fully Sharded Data Parallel (FSDP)
Here, instead of copying the full model to each GPU, FSDP shards it across GPUs. By that, I mean each GPU now holds a fraction of the model parameters, gradients, and optimizer states.
But if each GPU only has a piece of the model, how does it run a forward pass? It can do so because, before computing a layer, the GPUs gather the full parameters for that layer from all the other GPUs (this is known as all-gather). Once the computation is done, they discard the parameters they don’t own. And then, during the backward pass, the same thing happens, but in reverse: gather parameters, compute gradients, then distribute the gradients back using reduce-scatter so each GPU only keeps its shard. The diagram below shows this: each GPU owns a shard (in green) and gathers the other shards when it needs them for computation. This happens layer by layer as the forward and backward passes progress through the network.
The tradeoff is communication. Instead of just synchronizing gradients at the end like DDP, FSDP does it throughout the forward and backward passes. But now each GPU only needs to store 1/N of the model, where N is the number of GPUs.
A note on model parallelism: If FSDP splits the model across different GPUs, does that mean each GPU is responsible for computing different layers? Not quite. That would be model parallelism (sometimes called pipeline parallelism), where different GPUs hold different layers permanently, and data flows through them in sequence. FSDP works differently. Every GPU still processes all the layers. It’s just that the parameters are sharded for storage. During computation, the full layer is temporarily constructed, used, and then discarded again. So, FSDP is better thought of as memory sharding rather than splitting the model architecture. Therefore, it gives you memory benefits of splitting the model while still keeping the simplicity of data parallelism. We won’t go deeper into model parallelism here, but it’s worth knowing that they approach the problem differently.
Experiments
To see how these two strategies compare in practice, I ran experiments on two models: DINOv2 (a vision foundation model) and Mistral-7B (a large language model). All experiments used 4 A100 80GB GPUs on a single node.
DINOv2 on Food-101
First, I fine-tuned DINOv2 on Food-101, a classification dataset with 101 food categories and about 101k images. The task is to predict an image category given an image. I tested both the base (86M parameters) and large (304M parameters) backbones, using a batch size of 32 per GPU for the base model and 256 per GPU for the large model. This is a case where both strategies work since the model fits comfortably in GPU memory. But even here, we can see the memory difference between the two approaches.
FSDP uses significantly less memory per GPU: 0.39 GB vs 1.79 GB for the base model, and 1.39 GB vs 6.26 GB for the large model. That is roughly 4.5x less memory with FSDP. This matters less when you have plenty of headroom, but it’s a preview of why FSDP becomes essential for larger models. Now what about speed?
DDP is consistently faster, around 6% higher throughput for the base model and 4% for the large model. This is expected since DDP has less communication overhead. Training time tells the same story:
The takeaway here is that when the model fits in memory, DDP is the better choice. It’s simpler and faster. But notice that memory gap. What happens when we scale up to a model where that gap actually matters?
Mistral-7B on Alpaca
Now for the interesting part. I tried fine-tuning Mistral-7B on the Alpaca instruction-following dataset. With DDP, the job crashed immediately with an out of memory error. With FSDP, it ran without an issue.
DDP would need an estimated 87 GB per GPU for model state (parameters, gradients, and optimizers), exceeding the 80 GB available on an A100. FSDP shards this across 4 GPUs, bringing the estimated requirement down to about 22 GB. The actual peak memory during training was around 29 GB, which includes activation and other overhead. This is why FSDP is needed. It makes training possible when DDP simply can’t fit the model.
Code and Reproducibility
All experiments shown here were run using a shared training codebase, where the only change between DDP and FSDP was the distributed strategy configuration. If you’re interested in seeing the full setup, including training scripts and configs, the code is available here: https://github.com/hussenmi/distributed-training
Thank you for reading, and I hope you find this helpful!
