Distributed training technique that shards model parameters and optimizer states across devices.
Fully Sharded Data Parallel (FSDP) is a distributed training strategy designed to overcome the memory limitations of training large-scale deep learning models across multiple accelerators. Unlike standard data parallelism, which replicates the entire model on every device, FSDP partitions — or shards — model parameters, gradients, and optimizer states across all participating devices. Each device holds only a fraction of the total model at any given time, dramatically reducing per-device memory consumption and enabling the training of models that would otherwise be impossible to fit on a single GPU or TPU.
FSDP operates by dynamically gathering the full parameters of each layer just before a forward or backward pass, then immediately discarding them afterward to reclaim memory. During the backward pass, gradients are computed locally and then reduced across devices before being discarded again. This all-gather and reduce-scatter communication pattern, borrowed from collective communication primitives, ensures that every device contributes to and receives the correct parameter updates while minimizing the total volume of data transferred at any one moment. The approach is closely related to ZeRO (Zero Redundancy Optimizer), a technique pioneered by Microsoft's DeepSpeed library, and FSDP can be seen as PyTorch's native implementation of similar ideas.
FSDP became practically significant around 2020–2021 as the AI community confronted the challenge of training billion- and trillion-parameter models. Meta AI Research integrated FSDP directly into PyTorch, making it accessible to a broad audience without requiring specialized infrastructure. This democratized large-model training considerably, allowing academic labs and smaller organizations to experiment with architectures that previously demanded proprietary cluster software.
The importance of FSDP extends beyond raw memory savings. By enabling efficient multi-node, multi-GPU training with minimal code changes, it has become a foundational tool in the modern large language model (LLM) training stack. Frameworks built on top of FSDP, such as Hugging Face's Accelerate and various LLM fine-tuning libraries, rely on it to scale training runs for models like LLaMA and beyond. As model sizes continue to grow, sharding strategies like FSDP remain essential infrastructure for frontier AI research.