World Models · Diffusion Models · Efficient AI

Mirage: Latent Spatial Memory Makes Video World Models 10x Faster

Mirage stores a video world model's 3D memory inside diffusion latent space instead of an RGB point cloud, hitting state-of-the-art WorldScore (70.36) while running 10.57x faster and using 55x less GPU memory.

Mirage: Latent Spatial Memory Makes Video World Models 10x Faster

Quick answer

Mirage keeps a video world model spatially consistent by caching the scene’s 3D memory directly in diffusion latent space, not as an explicit RGB point cloud. That one change removes the expensive render-and-re-encode loop that previous methods repeat every chunk: Mirage runs end-to-end 10.57x faster, uses 55x less GPU memory, and still posts the top WorldScore average of 70.36, edging out the prior best (Spatia, 69.73). It is built on the Wan2.2-TI2V-5B video diffusion backbone.

The problem with RGB point-cloud memory

A video world model has to remember what it already generated, so that when the camera pans back to a wall it drew thirty frames ago, the wall is still there. The dominant way to enforce this is an explicit 3D point cloud held in RGB space: decode latents to pixels, lift pixels into 3D using depth, re-encode the rendered view back into latents for the next step.

That round trip is the bottleneck twice over. It is slow, because every chunk pays for VAE decoding, rendering, and VAE encoding. And it is lossy, because the diffusion model’s latent tokens carry rich learned features that get flattened the moment you collapse them to RGB and read them back. You are throwing away the representation the model actually thinks in.

How Mirage works

Mirage builds the memory in latent space and never leaves it. It lifts latent tokens into 3D with depth-guided back-projection, storing a persistent latent point cloud instead of an RGB one. To generate a new view, it queries that cache by warping the stored latents directly to the target camera pose, then feeds the warped latents to the diffusion model as a conditioning signal. No pixel decode, no re-encode.

The backbone is Wan2.2-TI2V-5B, whose VAE uses a spatial stride of 16, temporal stride of 4, and 48 latent channels — so each latent token already summarizes a 16x16 pixel block, which is exactly why keeping memory at that resolution is so much cheaper than per-pixel point clouds. Because warping happens on the same tokens the model conditions on, the geometry the model “sees” never makes a lossy detour through pixels.

One honest design choice does the heavy lifting: a dynamic-region filter. Moving objects have unreliable per-frame geometry, so Mirage excludes them from the persistent memory and only caches rigid scene structure. That keeps the 3D cache clean, but it also defines the method’s main limit (below).

Key results

  • WorldScore average 70.36, state-of-the-art, ahead of the prior best baseline Spatia at 69.73 — a narrow win on the headline score.
  • 3D Consistency 92.21 and Photometric Consistency 93.95, the metrics that directly measure whether revisited geometry stays put — this is where spatial memory should pay off.
  • Static Score 73.60 and Dynamic Score 67.11; the gap between them mirrors the dynamic-region filter — static, rigid scenes score higher than motion-heavy ones.
  • 10.57x faster end-to-end generation versus explicit RGB point-cloud memory.
  • 55x lower GPU memory footprint than those explicit 3D baselines.

The efficiency numbers are the real story; the WorldScore lead over Spatia is small enough that, on quality alone, this would be an incremental paper. It is the 10x speed and 55x memory at equal-or-better quality that make it matter.

Why this matters now

Video world models are moving toward interactive, long-horizon generation — game-like environments and embodied simulators where the camera roams freely and the scene has to persist for minutes, not seconds. The explicit-point-cloud approach does not scale to that: the render/encode cost grows with every remembered chunk. By keeping memory in latent space, Mirage attacks the exact cost that was blocking longer rollouts, and it does so without retraining a new backbone — it wraps a released diffusion model (Wan2.2). That makes the idea easy to adopt and easy to compare against.

Limits and open questions

The dynamic-region filter is the headline caveat. Because moving entities are excluded from persistent memory, Mirage does not maintain the state of dynamic actors across chunks — a person who walks out of frame and back is not guaranteed to return consistently. Scenes dominated by motion benefit far less than rigid, geometry-heavy ones, and the Dynamic Score (67.11) trailing the Static Score (73.60) shows it in the numbers.

The quality margin over Spatia (70.36 vs 69.73) is thin, so anyone evaluating Mirage should weigh it as an efficiency win, not a quality leap. The latent warping also inherits whatever depth estimation it depends on, so error in the depth-guided back-projection propagates into the cache. And every result is tied to one backbone, Wan2.2-TI2V-5B; whether latent spatial memory transfers cleanly to other VAEs with different strides and channel counts is untested here.

FAQ

What is latent spatial memory in Mirage?

It is a persistent 3D cache that stores a video world model’s scene memory inside the diffusion latent space instead of as an RGB point cloud. Latent tokens are lifted into 3D by depth-guided back-projection and queried by warping them to new camera poses, avoiding any pixel-space decode and re-encode.

How much faster is Mirage than point-cloud world models?

Mirage runs end-to-end 10.57x faster and uses 55x less GPU memory than explicit RGB point-cloud baselines, while reaching a higher WorldScore average (70.36 vs 69.73 for Spatia).

What backbone does Mirage use?

Mirage is built on the Wan2.2-TI2V-5B video diffusion model, whose VAE has a spatial stride of 16, temporal stride of 4, and 48 latent channels.

What is Mirage’s main weakness?

It drops moving objects from its persistent memory via a dynamic-region filter, so it does not keep dynamic actors consistent across chunks. Motion-heavy scenes gain much less than rigid-geometry scenes, and its WorldScore lead over the prior best is narrow.

One line: store the world model’s 3D memory in latent space, skip the pixel round trip, and get 10x speed at SOTA WorldScore. Read the original paper on arXiv.