Training diffusion models with mixed noise levels to enable flexible, controllable generation.
Diffusion forcing is a training and inference framework for sequence models that applies independent, variable levels of noise to different tokens or timesteps within a single sequence. Unlike standard diffusion models, which corrupt an entire input uniformly before denoising, diffusion forcing allows each element in a sequence to carry a different "noise level" simultaneously. This asymmetry gives the model a richer supervisory signal during training and unlocks new capabilities at inference time, such as generating sequences of arbitrary length or conditioning on partially observed data without retraining.
The mechanism draws on the mathematics of diffusion processes but reframes them as a per-token masking or corruption scheme. During training, each token in a sequence is independently assigned a noise level sampled from a schedule, and the model learns to predict the clean value of each token given its noisy neighbors. This forces the model to reason about uncertainty at multiple granularities at once, building internal representations that are robust to partial information. At inference, a practitioner can fix some tokens at zero noise (fully observed) and sample others from high noise (fully generated), effectively blending conditioning and generation in a single forward pass.
The practical significance of diffusion forcing lies in its flexibility. It naturally supports tasks that sit between pure generation and pure prediction—video synthesis conditioned on a few frames, planning under partial observability, or autoregressive generation with controllable look-ahead. Because the noise levels are explicit and continuous, the framework also provides a principled way to trade off diversity against fidelity by adjusting the noise schedule at test time, without the need for separate classifier guidance or fine-tuning steps.
Diffusion forcing emerged from research on unifying autoregressive and diffusion-based sequence modeling, with the term and formal framework introduced around 2024. It represents a convergence of ideas from score-based generative models, masked language modeling, and reinforcement learning, and is particularly relevant to robotics, video generation, and decision-making domains where temporal structure and partial observability are central challenges.