Institution

Fudan University

A leading comprehensive research university in Shanghai, China, with active computer vision and multimodal AI research groups.

Multimodal Models · Fudan University

ARM: An AutoRegressive Multimodal Model with Unified Discrete Tokens

ARM is a 7B autoregressive model that does image understanding, text-to-image, and editing in one next-token framework, on a shared discrete tokenizer; RL lifts GenEval 0.79 to 0.86 and GEdit-EN overall 5.75 to 6.68.

Theorem Proving · MiniMax AI

MaxProof: How MiniMax M3 Reaches Gold-Level Proof Scores

MaxProof turns MiniMax-M3 into a generator, verifier, fixer, and ranker; with population-level test-time scaling it reports 35/42 on IMO 2025 and 36/42 on USAMO 2026.

Agent Memory · ByteDance

TaskMem: Teaching a Video Agent What Is Worth Remembering

TaskMem trains a multimodal agent to write its own memory with RL, lifting streaming-video QA accuracy to 67.9% on VideoMME and 45.4% on EgoLife, gains of 6.3 and 7.0 points over the Qwen3-VL-30B baseline.

World Models · Fudan University

WBench: A Multi-turn Benchmark for Interactive Video World Models

WBench scores interactive video world models on five axes — quality, setting, interaction, consistency, physics — across 289 cases and 1,058 turns, and finds no single model wins on all five.