ResNet Explained: Deep Residual Learning for Image Recognition

Quick answer

ResNet lets you train networks far deeper than was previously possible by having each block learn a residual — the change to add to its input — instead of an entire new mapping, wired through an identity “skip connection” that lets the input bypass the block. With this, He and colleagues at Microsoft Research trained networks up to 152 layers — 8x deeper than VGG but with lower complexity — and an ensemble reached 3.57% top-5 error on ImageNet, winning 1st place at ILSVRC 2015.

The degradation problem

The motivating puzzle is not overfitting. The authors observed that simply stacking more plain layers makes a network worse — a 56-layer plain net had higher training error than a 20-layer one. That is counterintuitive: a deeper model can in principle copy the shallower one and add identity layers, so it should never be worse. The fact that optimizers fail to find that solution is what the paper calls the degradation problem: very deep plain networks are hard to optimize, not just hard to generalize.

How a residual / skip connection works

Instead of asking a stack of layers to learn a target mapping H(x) directly, ResNet asks it to learn the residual F(x) = H(x) - x, then recovers the output as F(x) + x. The + x term is the skip connection: the input is carried forward unchanged and added to the block’s output. The intuition is that if the ideal function for a block is close to identity, driving the weights toward zero — so the block outputs roughly its input — is far easier than learning identity from scratch through nonlinear layers. Crucially, the identity shortcut adds no extra parameters and no real compute, so the comparison against a plain network is apples-to-apples.

Key results

ImageNet (ILSVRC 2015): an ensemble of residual nets reached 3.57% top-5 error, taking 1st place in the classification task.
Depth that actually helps: a 152-layer ResNet outperformed shallower ones, reversing the degradation trend — and it has lower complexity (FLOPs) than VGG-16/19 despite being 8x deeper.
CIFAR-10 to the extreme: the authors trained nets with 100 and even 1000 layers to study optimization behavior at depths nobody trained before.
It transfers: purely from the deeper learned representations, ResNet delivered a 28% relative improvement on COCO object detection, and the same backbone won 1st place at ILSVRC and COCO 2015 on detection, localization, and segmentation.

Why it matters now

This is one of the most-cited papers in all of machine learning, and the reason is structural, not just a leaderboard number. The residual/skip connection turned out to be a general training primitive, not an image-only trick. The same idea — add the input back to a sublayer’s output — is the residual stream at the heart of every Transformer, including the language models written about elsewhere on this site. When people say modern deep learning “just works” at depth, a large part of why is this paper: it removed depth as an optimization barrier and made “go deeper” a safe default.

Limits and open questions

ResNet solved trainability, not understanding. The paper’s own explanation — that residual functions are easier to optimize — is empirical; it does not prove why identity shortcuts cure degradation, and later work (e.g. analyses of the loss landscape and “iterative refinement” views) kept revisiting that question. The headline 3.57% is an ensemble result, so a single model is somewhat worse. The extreme 1000-layer CIFAR net actually generalized slightly worse than the 100-layer one, hinting the recipe is not “deeper is always better.” And the original design predates batch-norm placement debates and the pre-activation variant the same authors published shortly after, which trains very deep nets more reliably — so the v1 architecture here is the breakthrough, not the final word.

FAQ

What problem does ResNet solve?

The degradation problem: stacking more plain layers raised training error, so very deep networks were too hard to optimize. ResNet’s skip connections make those depths trainable, so a 152-layer net beats shallow ones instead of doing worse.

How do skip connections in ResNet work?

A block learns a residual F(x) and the network outputs F(x) + x, carrying the input forward unchanged via an identity shortcut. If identity is the best mapping, the block only has to push its weights toward zero, which is much easier than learning identity through nonlinear layers.

How accurate is ResNet on ImageNet?

An ensemble of residual nets reached 3.57% top-5 error on the ImageNet test set, winning 1st place in the ILSVRC 2015 classification task. The same representations also gave a 28% relative gain on COCO detection.

Why is the ResNet paper so important?

The skip connection became a universal building block. It is the residual stream inside every Transformer, so ResNet’s contribution reaches far beyond vision — it is one of the most-cited papers in machine learning for exactly that reason.

One line: don’t make a layer learn everything — let it learn the small correction and pass the input straight through. Read the original paper on arXiv.