Generative models that learn to reverse a noise-addition process to synthesize new data.
Diffusion models are a class of generative models that learn to create new data by reversing a gradual noising process. During training, clean data samples — most commonly images — are progressively corrupted by adding small amounts of Gaussian noise across hundreds or thousands of discrete timesteps until the original signal is completely destroyed and only random noise remains. A neural network, typically a U-Net or transformer architecture, is then trained to predict and remove that noise one step at a time, effectively learning the statistical structure of the original data distribution.
At inference time, the model starts from pure random noise and iteratively denoises it, step by step, until a coherent, high-quality sample emerges. This reverse diffusion process can be guided by conditioning signals — such as a text prompt, a class label, or a reference image — allowing precise control over what gets generated. Techniques like classifier-free guidance amplify this conditioning, dramatically improving output relevance and quality. The mathematical framework draws on stochastic differential equations and score matching, connecting diffusion models to earlier work on energy-based and score-based generative approaches.
Diffusion models rose to prominence around 2020 with the publication of Denoising Diffusion Probabilistic Models (DDPM), which demonstrated that the approach could match and exceed the sample quality of generative adversarial networks (GANs) while being more stable to train and avoiding mode collapse. Subsequent advances — including DALL·E 2, Stable Diffusion, and Imagen — extended the framework to text-to-image synthesis, video generation, audio synthesis, and molecular design, cementing diffusion models as one of the most versatile and powerful tools in modern generative AI.
The significance of diffusion models lies in their combination of training stability, output diversity, and scalability. Unlike GANs, they do not require adversarial training between competing networks, and unlike variational autoencoders, they impose fewer restrictive assumptions on the latent space. Their ability to produce photorealistic images, coherent audio, and even novel protein structures has made them central to both commercial AI products and cutting-edge research across science and creative industries.