On the Geometry of On-Policy Distillation: A Distinct Update Regime

Quick answer

On-policy distillation (OPD) is not a midpoint between supervised fine-tuning and RL — it has its own update geometry. Across a battery of parameter-space diagnostics, OPD updates touch fewer weights than SFT and avoid the model’s principal directions more strongly, yet stay looser than RLVR. The striking finding is “subspace locking”: OPD’s cumulative weight changes collapse into a narrow, low-dimensional channel very early in training, and constraining all later updates to that early subspace preserves OPD’s performance while badly degrading SFT under the same constraint.

What OPD is and why its dynamics were a black box

On-policy distillation trains a student model on its own sampled outputs (rollouts), graded against a stronger teacher’s token-level distribution — so the student learns on the trajectories it actually produces, not on a fixed corpus. It has become a go-to recipe for transferring reasoning from a large teacher into a smaller student, sitting somewhere near the intersection of distillation and RL.

The problem the paper attacks is that almost everyone treats OPD as “RL-flavored SFT” or “SFT-flavored RL” without checking. If OPD were merely a blend, you would expect its parameter-space footprint to interpolate between the two. The authors test that assumption directly instead of arguing about it in the abstract.

How the parameter-space diagnostics work

The method is diagnostic, not a new training algorithm. The authors run three training regimes — OPD, SFT, and RLVR (reinforcement learning with verifiable rewards) — on the same setup and measure where in weight space each one moves the model:

Locality: how many weights move meaningfully, and how concentrated the movement is.
Principal-direction alignment: whether updates ride along the model’s dominant existing directions (as SFT tends to) or deliberately avoid them.
Rank dynamics over time: the effective dimensionality of the cumulative update as training proceeds.

OPD lands in what they call a “relaxed off-principal regime”: fewer weights affected than SFT, stronger avoidance of principal directions, but less tightly constrained than RLVR. That places OPD outside the SFT–RLVR line, not on it.

The subspace-locking result

The load-bearing claim is dynamic, not static. OPD’s cumulative updates rapidly funnel into a narrow low-dimensional subspace early in training — they “lock.” To test whether that subspace is incidental or functional, the authors freeze the update space to the early-formed subspace and continue training only inside it. OPD keeps its performance under this constraint; SFT does not — the same restriction substantially degrades it. That asymmetry is the evidence that the locked subspace is functionally sufficient for OPD: the channel OPD finds early is the channel it actually needs.

Key results

Off-principal placement: OPD updates affect fewer weights than SFT and avoid principal directions more strongly, while remaining less constrained than RLVR — so OPD is not an interpolation of the two.
Subspace locking: cumulative OPD updates enter a narrow low-dimensional channel early in training rather than spreading out.
Functional sufficiency: constraining training to that early subspace preserves OPD performance but substantially degrades SFT under the identical constraint.
Robust rank dynamics: sparsifying which tokens get updated, and shifting rollout generation off-policy, both leave the rank dynamics intact.
The exception that proves the rule: mixing the OPD objective with RLVR changes the rank dynamics — so the geometry is a property of the OPD objective itself, not of the data or token selection.

Why this matters now

OPD has quietly become a default for distilling reasoning into smaller models, but it has been tuned by folklore. If OPD’s useful updates genuinely live in a low-dimensional subspace fixed early in training, that is a concrete lever: it hints that OPD could be run with far fewer effective parameters, that early-training signals predict the final useful subspace, and that mixing OPD with RLVR is not a free lunch because it disturbs the very geometry that makes OPD efficient.

Limits and open questions

This is a diagnostic study, and it stops short of cashing in its own implications. The honest gap: the paper shows the locked subspace is sufficient to preserve performance, but does not turn that into a method that trains faster or cheaper by exploiting it — that engineering payoff is asserted, not demonstrated. The conclusions also rest on the specific models, teacher, and tasks studied; whether “subspace locking” holds across model scales, modalities, and very different teacher–student gaps is open. And “fewer weights, off-principal, low rank” are correlational geometric signatures — they describe where OPD moves, not a mechanistic reason it generalizes better, so the causal story remains a hypothesis.

FAQ

What is on-policy distillation (OPD) and how is it different from SFT?

On-policy distillation trains a student on its own sampled rollouts, scored against a teacher’s token-level distribution, so it learns on the trajectories it actually generates. SFT trains on a fixed external dataset. This paper shows the difference is not cosmetic: OPD touches fewer weights and avoids principal directions far more than SFT.

What is “subspace locking” in on-policy distillation?

Subspace locking is the paper’s finding that OPD’s cumulative weight updates collapse into a narrow, low-dimensional channel early in training. Freezing later updates to that early subspace preserves OPD performance, which means the locked subspace is functionally sufficient for what OPD does.

Is OPD just a blend of SFT and RLVR?

No. The parameter-space diagnostics place OPD in a relaxed off-principal regime that lies outside the SFT–RLVR line rather than between the two, so OPD induces its own update geometry rather than interpolating the others.

Does mixing OPD with RLVR help?

Not for free. Sparsifying update tokens or going off-policy on rollouts leaves OPD’s rank dynamics intact, but mixing the OPD objective with RLVR changes those dynamics — a sign that combining the two disturbs the geometry that makes OPD distinctive.

One line: on-policy distillation is its own beast in weight space — low-rank, off-principal, and locked early. Read the original paper on arXiv.