Domino: Splitting the Draft and the Causal Fix in Speculative Decoding

Quick answer

Domino is a speculative-decoding method that stops forcing one model to do two conflicting jobs. A parallel backbone drafts an entire block of tokens in a single forward pass for speed, and a small Domino head then rewrites those guesses using the causal dependencies between draft tokens that the parallel pass ignored. The reported gains are up to 5.49x end-to-end speedup with a Transformers backend and up to 5.8x throughput under SGLang serving — without changing what the target model outputs, because speculative decoding only ever accepts tokens the target would have produced anyway.

The trade-off Domino is built around

Every speculative decoder hits the same wall. An autoregressive drafter (the EAGLE family) generates draft tokens one at a time, so each token sees the ones before it and the guesses are good — but you pay a sequential forward pass per draft token, which is slow. A parallel drafter (Medusa-style heads) emits the whole block in one shot and is fast, but each position is predicted blind to its neighbors, so the block is internally inconsistent and the target model rejects more of it. Higher acceptance or lower draft cost — you normally pick one.

Domino’s claim is that these are separable problems and should be solved by separate modules, instead of asking a single drafter to compromise between them.

How Domino works

The draft stage runs in two passes. First, the parallel backbone produces a preliminary distribution for every position in the block at once — cheap, but each position is unaware of its siblings. Then the Domino head, a lightweight module, takes those preliminary distributions plus the confirmed prefix and refines each position using prefix-dependent causal information. The result is a block that is both produced quickly and internally coherent, so a larger fraction survives the target model’s verification step.

The second ingredient is training. Domino uses a base-anchored training curriculum: early training keeps the parallel backbone strong and stable, and only gradually shifts the objective toward the causally corrected final distributions. The motivation is practical — if you optimize the causal correction too aggressively from the start, you risk degrading the parallel backbone the head depends on. Anchoring to the base keeps both halves useful.

Why decoupling is the interesting bet

The honest read: Domino’s contribution is architectural framing more than a brand-new mechanism. Parallel drafting and causal refinement both existed; the bet is that explicitly separating “propose fast” from “fix dependencies” gives a cleaner Pareto curve than tuning a single drafter to do both. The speedup numbers suggest the bet pays off, but the framing also adds a second module and a custom training schedule, so it is not free in engineering terms.

What makes it matter now is deployment economics. Inference, not training, is where most LLM money goes once a model is live, and speculative decoding is one of the few accelerators that is exactly lossless on output distribution. A method that pushes the speedup ceiling higher while keeping that guarantee is directly bankable for anyone serving a large model.

Key results

Up to 5.49x end-to-end speedup when integrated with a Hugging Face Transformers backend.
Up to 5.8x throughput speedup under SGLang, a production-grade serving engine — the more telling number, because it reflects batched serving rather than single-stream latency.
The acceleration is lossless in the speculative-decoding sense: the target model still verifies every draft block, so accepted tokens match what the target would have generated on its own.
The gain comes from raising the accepted-tokens-per-step ratio (the parallel block stays coherent after the causal head) rather than from cutting draft cost alone.

Limits and open questions

The abstract reports speedup multipliers but does not, in the portion available, pin them to head-to-head acceptance-length numbers against EAGLE-2/3 or Medusa on a fixed model and dataset — and speedup figures are notoriously sensitive to base model, batch size, hardware, and the acceptance threshold, so the 5.49x and 5.8x should be read as best-case, not typical. The extra Domino head and the base-anchored curriculum add training and serving complexity that a single-module drafter avoids; whether that complexity is worth it depends on how much of the gain survives outside the authors’ setup. There is also no stated limitations section, so robustness across model families, long-context prompts, and very large batch sizes is unestablished from the abstract alone. Treat it as a promising decoupling idea with strong reported numbers that the community still needs to reproduce.

FAQ

How does Domino speculative decoding get up to 5.49x speedup?

Domino drafts an entire block in one parallel forward pass, then runs a lightweight causal Domino head to make the block internally consistent before the target model verifies it. More of each block survives verification, so more tokens are committed per target step — that higher acceptance, not a cheaper draft alone, is where the up-to-5.49x (Transformers) and 5.8x (SGLang) gains come from.

Does Domino change the model’s output quality?

No. Like all speculative decoding, Domino only proposes candidate tokens; the original target model still verifies each one, so any accepted token is exactly what the target would have generated. The method trades extra drafting compute for speed, not for accuracy.

How is Domino different from EAGLE and Medusa?

EAGLE drafts autoregressively (high acceptance, slow per-token), and Medusa drafts in parallel (fast, lower acceptance). Domino splits these into separate modules: a parallel backbone for speed plus a causal head that restores the token dependencies a parallel drafter loses, aiming to get both benefits at once.

What is the base-anchored training curriculum in Domino?

It is a schedule that keeps the parallel draft backbone strong early in training and only gradually shifts the objective toward the causally corrected output. This stops aggressive optimization of the causal head from degrading the backbone it relies on.

One line: let one module draft fast and another module add the causal dependencies back, instead of forcing one drafter to compromise. Read the original paper on arXiv.