ControlNet: Adding Spatial Control to Diffusion Models

Quick answer

ControlNet adds spatial control (edges, depth, human pose, segmentation maps) to a large pretrained text-to-image diffusion model without retraining or wrecking it. It freezes the original Stable Diffusion weights, clones the encoder into a trainable copy, and connects the two with “zero convolutions”: convolution layers whose weights start at exactly zero. Because the injected signal is zero at step one, the model begins training as an exact copy of Stable Diffusion and only gradually learns to respond to the condition. The headline practical result is stable training on small datasets, fewer than 50k images, where naive fine-tuning would overfit or collapse.

The problem: text prompts can’t say “put the arm here”

Text-to-image models like Stable Diffusion are excellent at content but terrible at spatial layout. You cannot reliably describe “a person standing exactly in this pose, with the horizon at this height, this building’s outline preserved” in a prompt. Designers and artists want to draw a sketch, hand over a depth map, or trace a pose skeleton and have the model fill it in. The obvious fix, fine-tuning the whole model on (condition, image) pairs, fails in practice. The paired datasets are small, and full fine-tuning of a billion-parameter model on a small set degrades the rich generative prior it learned from billions of images.

Locking the base model, training a copy

ControlNet’s core move is to never touch the original weights. It keeps the production-ready Stable Diffusion model frozen and makes a trainable copy of its encoder blocks. The frozen branch preserves everything the model already knows; the trainable copy learns to read the new conditioning input. The two are fused, so the original network still does the heavy lifting of generation while the copy nudges it toward the spatial constraint. This is why a single GPU can train a usable ControlNet: you are learning a control adapter, not a new image model.

Zero convolutions

The trick that makes this stable is the connection layer: a 1x1 convolution initialized with both weights and bias at zero, placed before and after the trainable copy. At the first training step a zero-weight convolution outputs zero, so the conditioning branch contributes nothing and the combined network is byte-for-byte the original Stable Diffusion. No random noise is injected into a carefully pretrained model, so training cannot start by corrupting it. As gradients flow, the zero layers grow their weights away from zero and the control signal fades in. The authors show the gradient through a zero convolution is generally non-zero, so the layer does learn rather than staying stuck. That detail is what separates this from simply muting a branch.

Key results

Many conditions, one architecture. The same recipe works for Canny edges, Hough lines, HED soft edges, user scribbles, human pose (Openpose), depth maps, normal maps, semantic segmentation, and cartoon line art. Each is trained as a separate ControlNet on Stable Diffusion.
Trains on small data. The authors report stable training with datasets under 50k images, and that it scales up past 1M; the zero-convolution init is what prevents the small-data regime from collapsing.
Composable. Multiple ControlNets can be combined (e.g. pose plus depth), and conditions work with or without an accompanying text prompt.
No catastrophic forgetting. Because the base is frozen, the model keeps its full generative quality; ablations in the paper show that replacing zero convolutions with standard initialization hurts results.

Why it matters now

ControlNet is the reason “draw a sketch, get a controlled image” became a standard workflow rather than a research demo. It turned diffusion models from prompt-only slot machines into directable tools, and the frozen-base-plus-trainable-copy pattern is now a default way to add capabilities to a large model without retraining it. It sits conceptually next to adapters and LoRA, but operates on spatial structure. For anyone building image pipelines, it is the bridge between a layout you already have and a model you cannot afford to retrain.

Limits and open questions

ControlNet is an add-on, not a free lunch. Each condition type needs its own trained model and its own preprocessor (an edge detector, a pose estimator, a depth network), so the quality of your control is capped by the quality of that off-the-shelf annotator. A bad pose estimate yields a bad result. It controls structure, not semantics: it will faithfully follow a depth map into an image that ignores fine prompt details, and conflicting condition-versus-prompt signals can fight. It inherits every bias and failure mode of the frozen base model rather than fixing any. And the small-data stability, while real, still assumes you have a frozen model pretrained on web-scale data to stand on. ControlNet does not reduce that upstream cost.

FAQ

How does ControlNet add control without breaking Stable Diffusion?

ControlNet freezes the original Stable Diffusion weights and trains only a cloned copy of the encoder, joined by zero-initialized convolutions. At the start of training the control branch outputs zero, so the network behaves exactly like the unmodified model and learns the new condition gradually instead of overwriting the prior.

What are zero convolutions in ControlNet?

Zero convolutions are 1x1 convolution layers whose weights and biases are initialized to zero, placed around ControlNet’s trainable copy. They make the conditioning contribution start at nothing, so no harmful noise reaches the pretrained model, and they grow their weights during training as the model learns to use the condition.

What conditions can ControlNet use?

ControlNet has been trained for Canny edges, HED soft edges, user scribbles, straight lines, human pose, depth maps, surface normals, semantic segmentation, and line art, among others. Each is a separately trained ControlNet, and several can be combined on one generation.

Does ControlNet need a huge dataset?

No. The paper reports that ControlNet trains stably on datasets under 50k images and also scales beyond 1M. The zero-convolution initialization is what keeps small-dataset training from degrading the frozen base model.

One line: freeze the model you trust, train a copy connected by zeros, and let spatial control fade in without ever corrupting the original. Read the original paper on arXiv.