A neural network architecture that adds precise spatial controls to pretrained diffusion models.
ControlNet is a neural network architecture that enables fine-grained spatial conditioning of pretrained diffusion models—such as Stable Diffusion—without modifying or degrading the original model's learned capabilities. Introduced by Lvmin Zhang and Maneesh Agrawala in 2023, it allows users to guide image generation using structured inputs like edge maps, depth maps, human pose skeletons, segmentation masks, and other spatial signals, giving creators far more precise control over generated outputs than text prompts alone can provide.
The architecture works by creating two copies of a pretrained neural network's encoder blocks: a "locked" copy that preserves the original model weights exactly as trained, and a "trainable" copy that learns to incorporate the new conditioning information. These two branches are connected through "zero convolution" layers—1×1 convolutional layers initialized with both weights and biases set to zero. This initialization is critical: it ensures that at the start of training, the trainable branch contributes nothing to the output, so the model begins from a stable baseline identical to the original. As training progresses, the zero convolutions gradually learn meaningful contributions, allowing the network to adapt without catastrophic interference with pretrained knowledge.
This design has several practical advantages. Because the locked copy remains intact, ControlNet can be trained on relatively small, task-specific datasets without risking the quality of the base model. The trainable branch can also be swapped or combined, meaning multiple ControlNet modules can be applied simultaneously to stack different types of spatial control. The architecture is computationally efficient enough to run on consumer-grade GPUs, democratizing access to sophisticated image generation workflows.
ControlNet represents a broader trend in generative AI toward modular, composable conditioning systems. Rather than retraining massive foundation models from scratch for each new task, ControlNet-style adapters allow targeted capability extension at a fraction of the cost. Its influence has extended beyond image generation into video and 3D synthesis, and it has inspired related adapter frameworks that apply similar principles to other large pretrained models.