Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. FSDP (Fully Sharded Data Parallel)

FSDP (Fully Sharded Data Parallel)

Distributed training technique that shards model parameters and optimizer states across devices.

Year: 2021Generality: 485
Back to Vocab

Fully Sharded Data Parallel (FSDP) is a distributed training strategy designed to overcome the memory limitations of training large-scale deep learning models across multiple accelerators. Unlike standard data parallelism, which replicates the entire model on every device, FSDP partitions — or shards — model parameters, gradients, and optimizer states across all participating devices. Each device holds only a fraction of the total model at any given time, dramatically reducing per-device memory consumption and enabling the training of models that would otherwise be impossible to fit on a single GPU or TPU.

FSDP operates by dynamically gathering the full parameters of each layer just before a forward or backward pass, then immediately discarding them afterward to reclaim memory. During the backward pass, gradients are computed locally and then reduced across devices before being discarded again. This all-gather and reduce-scatter communication pattern, borrowed from collective communication primitives, ensures that every device contributes to and receives the correct parameter updates while minimizing the total volume of data transferred at any one moment. The approach is closely related to ZeRO (Zero Redundancy Optimizer), a technique pioneered by Microsoft's DeepSpeed library, and FSDP can be seen as PyTorch's native implementation of similar ideas.

FSDP became practically significant around 2020–2021 as the AI community confronted the challenge of training billion- and trillion-parameter models. Meta AI Research integrated FSDP directly into PyTorch, making it accessible to a broad audience without requiring specialized infrastructure. This democratized large-model training considerably, allowing academic labs and smaller organizations to experiment with architectures that previously demanded proprietary cluster software.

The importance of FSDP extends beyond raw memory savings. By enabling efficient multi-node, multi-GPU training with minimal code changes, it has become a foundational tool in the modern large language model (LLM) training stack. Frameworks built on top of FSDP, such as Hugging Face's Accelerate and various LLM fine-tuning libraries, rely on it to scale training runs for models like LLaMA and beyond. As model sizes continue to grow, sharding strategies like FSDP remain essential infrastructure for frontier AI research.

Related

Related

Data Parallelism
Data Parallelism

Training technique that splits data across multiple processors running identical model copies simultaneously.

Generality: 794
SPDL (Scalable and Performant Data Loading)
SPDL (Scalable and Performant Data Loading)

A high-performance framework for efficiently loading large datasets into ML training pipelines.

Generality: 96
Federated Training
Federated Training

Collaborative model training across distributed devices without centralizing raw data.

Generality: 694
Parallelism
Parallelism

Simultaneous execution of multiple tasks across processors to accelerate computation.

Generality: 865
Federated Learning
Federated Learning

A training approach that learns from decentralized data without ever centralizing it.

Generality: 711
FSL (Few-Shot Learning)
FSL (Few-Shot Learning)

Training models to generalize accurately from only a handful of labeled examples.

Generality: 710