A high-performance framework for efficiently loading large datasets into ML training pipelines.
SPDL (Scalable and Performant Data Loading) is a data ingestion framework developed at Meta to address the chronic bottleneck of feeding large-scale datasets into deep learning training systems fast enough to keep accelerators like GPUs and TPUs fully utilized. As model sizes and dataset volumes have grown dramatically, the data pipeline itself — not the compute — has increasingly become the limiting factor in training throughput. SPDL was designed specifically to close this gap by providing a production-grade, high-performance solution for the data loading stage of ML workflows.
At its core, SPDL leverages asynchronous I/O, multithreading, and efficient memory management to overlap data fetching, decoding, and preprocessing with ongoing model computation. Rather than relying on Python's standard data loading utilities, which suffer from the Global Interpreter Lock (GIL) and other performance constraints, SPDL implements critical components in C++ and exposes them through a Python interface. This architecture allows it to saturate high-bandwidth storage systems and deliver preprocessed batches to the training loop with minimal stall time, even when working with complex media formats like video or audio.
SPDL also emphasizes scalability across distributed training setups, where data must be sharded and delivered consistently across many workers without introducing synchronization overhead or load imbalance. It integrates with existing ML ecosystems and storage backends, making it practical to adopt in large-scale production environments without a complete pipeline rewrite. The framework's design reflects lessons learned from running some of the world's largest AI training jobs, where even small inefficiencies in data loading compound into significant wasted compute.
The broader significance of SPDL lies in highlighting data loading as a first-class engineering concern in ML infrastructure. As the field moves toward training on trillions of tokens or petabytes of multimodal data, frameworks like SPDL become essential infrastructure — not optional optimizations. Its open-source release has made these performance techniques accessible beyond Meta, influencing how the wider ML community thinks about building efficient, scalable training pipelines.