Microsoft Research's Mirage

Mirage is a new video world model developed by Microsoft Research and several universities, which enables persistent spatial memory in video generation, allowing it to keep track of a scene's spatial structure even during long camera moves. This is achieved by storing internal image features in 3D space, rather than relying on pixel-based memory, which speeds up generation and reduces memory usage. By skipping the costly detour through pixel-based memory, Mirage improves the efficiency and consistency of video generation.

Introduction to Mirage

Mirage takes a different approach to video world models, which typically turn a starting frame and a camera path into plausible moving images. The model stores internal image features in 3D space, creating an entry in spatial memory, and projects this store straight onto the target camera to generate new viewpoints. This approach eliminates the need to render a point cloud and re-encode it, reducing compute costs and information leakage.

Technical Details

The Mirage model builds videos in segments, seeding the spatial memory from the starting image, and then pulls relevant data from memory to generate new frames, writing their contents back to the cache. A filter is used to strip out moving objects and the sky before writing to memory, ensuring that only stable geometry is stored. The researchers built on Alibaba's open-source video model Wan2.2, adding a small add-on module that teaches the model to use the new memory, and fine-tuning the whole thing with LoRA adapters.

Performance and Efficiency

On the WorldScore benchmark, Mirage outperforms its closest rival Spatia, and general video generators like Wan2.1 and CogVideoX, exceling at holding a scene's spatial structure together and keeping surfaces looking consistent across many frames. It also leads two of three metrics on the RealEstate10K dataset in the closed-loop test, demonstrating its ability to maintain consistency over long camera paths. Mirage's compute cost per frame remains nearly flat across the whole run, while rival models become increasingly demanding, with the researchers reporting up to 10.57x faster generation and up to 55x less memory usage.

Broader Impact

The development of Mirage has significant implications for the broader AI landscape, particularly in the area of video world models, which are one of the hottest research areas in AI video. Models like Veo and Genie 3 have shown promising results in generating interactive environments, and Mirage's ability to maintain consistent spatial structure over time could enable new applications in fields like simulation, gaming, and virtual reality.

Key Takeaways