Imaginative Perception Tokens: Letting VLMs Picture Space, Not Describe It

Quick answer

Imaginative Perception Tokens (IPT) make a vision-language model generate an intermediate image of what a scene would look like under a different spatial configuration — a new viewpoint, a traced path — and reason over that rendered evidence rather than over words. Built on the BAGEL backbone and trained on roughly 20K examples across perspective taking, path tracing, and multiview counting, IPT raises multiview counting accuracy by 3.4% and stays competitive with closed-source models on path tracing. The sharper finding: text chain-of-thought training sometimes made spatial accuracy worse, because forcing geometry through language is a modality mismatch.

The problem: spatial reasoning forced through language

Vision-language models are weak at questions that require mentally moving through a scene — “what would the table look like from the other side”, “does this path reach the door”, “how many distinct objects appear across these views”. The default fix is textual chain-of-thought: make the model narrate its reasoning step by step before answering. For spatial tasks that narration is the wrong tool. A model describing rotation, occlusion, and re-identification in prose is translating a geometric operation into a sequence of tokens that was never meant to carry coordinates. The paper’s name for this is a modality mismatch, and it shows up as chain-of-thought training that degrades rather than helps.

How Imaginative Perception Tokens work

IPT externalize the model’s spatial guess as a perceptual artifact instead of a sentence. When asked a spatial question, the model emits tokens that decode into an intermediate image — its prediction of what it would perceive under the alternative configuration the question implies. That imagined view becomes supervision and evidence: the model is trained so that its rendered guess matches the true alternative view, and downstream the answer is read off the imagined perception rather than off a text description.

The system uses BAGEL — a unified model that can both understand and generate images — as the backbone, which is what makes “predict the pixels you’d see” a native operation rather than a bolt-on. Training spans three spatial skills: perspective taking (re-render from a new viewpoint), path tracing (follow a route through a scene), and multiview counting (reconcile object counts across several images of the same scene), totaling about 20K examples.

Why imagined pixels beat narrated steps

The win is that geometry stays in the visual modality end to end. Counting objects across multiple views fails when a model re-describes each view in text and loses track of which object is which; it works better when the model reconstructs a consistent visual scene and counts there. The paper also reports that combining IPT supervision with plain label-only supervision compounds — the imagined-view signal and the final-answer signal are not redundant. That is a useful, non-obvious result: it implies the intermediate render teaches something the answer label alone cannot.

Key results

Multiview counting: +3.4% accuracy from IPT over the relevant baseline — the headline gain.
Path tracing: competitive with closed-source models, meaning an open BAGEL-based system narrows the gap to proprietary VLMs on at least one spatial task.
IPT + label-only supervision stacks, yielding further improvement over either signal alone.
Text chain-of-thought sometimes degraded performance, the clearest evidence for the modality-mismatch claim — narrating geometry can be a net negative.
Scope is three tasks, ~20K examples — a focused study, not a broad benchmark sweep.

Limits and open questions

The gains are modest in absolute terms — 3.4% on one task is real but not transformative, and “competitive with closed models” is qualified, not a clean win. The evaluation covers three hand-picked spatial skills on a small dataset, so whether IPT helps on open-world navigation, robotics, or messy real photos is untested. Generating an intermediate image per query is more expensive than emitting text, and the paper does not foreground that latency and compute cost. IPT also inherits whatever the BAGEL generator gets wrong: if the imagined view is hallucinated, the model is now reasoning over a confidently wrong picture. The honest read is that this is a strong directional result — spatial reasoning belongs in pixels, not prose — more than a finished, deployable recipe.

FAQ

What are Imaginative Perception Tokens (IPT)?

IPT are tokens a vision-language model emits that decode into an intermediate image showing what it would perceive under a different spatial configuration, such as a new viewpoint. The model reasons over that rendered view instead of describing the scene in text.

Why does text chain-of-thought hurt VLM spatial reasoning?

Narrating geometry in language is a modality mismatch: rotation, occlusion, and object re-identification are visual operations, and forcing them through a token sequence loses information. The paper found text chain-of-thought training sometimes lowered spatial accuracy rather than raising it.

How much does IPT improve spatial reasoning?

IPT improves multiview counting accuracy by 3.4% and stays competitive with closed-source models on path tracing, across perspective taking, path tracing, and multiview counting tasks trained on about 20K examples.

What model does IPT build on?

IPT uses BAGEL, a unified model that both understands and generates images, as its backbone. That dual capability is what lets the model render the viewpoint it imagines rather than only caption it.

One line: stop making vision models describe space in words — let them imagine the pixels and reason there. Read the original paper on arXiv.