MaxProof: How MiniMax M3 Reaches Gold-Level Proof Scores
MaxProof turns MiniMax-M3 into a generator, verifier, fixer, and ranker; with population-level test-time scaling it reports 35/42 on IMO 2025 and 36/42 on USAMO 2026.
Institution
A leading Chinese research university in Beijing, prominent in natural language processing, multimodal learning, and AI research.
MaxProof turns MiniMax-M3 into a generator, verifier, fixer, and ranker; with population-level test-time scaling it reports 35/42 on IMO 2025 and 36/42 on USAMO 2026.
MiniMax Sparse Attention keeps only 2,048 selected KV tokens per query group and reports 28.4x lower attention FLOPs plus 14.2x prefill speedup at 1M context.
AI Agents · TokenRhythm Technologies
Claw-SWE-Bench evaluates OpenClaw-style coding-agent harnesses on 350 GitHub issue tasks. OpenClaw jumps from 19.1% to 73.4% Pass@1 with a full adapter.
Multimodal Models · Peking University
A survey that reframes long-video MLLMs as three abilities (watch, remember, reason), comparing against 11 prior surveys and organizing 100+ methods plus 5 application domains.
LLM Reasoning · Samsung Research
TrOPD masks on-policy distillation to the tokens where the teacher is actually trustworthy, adding +3.06 to +3.52 average points over standard OPD on math, code, and STEM benchmarks with 1.5B-1.7B students.
Video2GUI turns 500M unlabeled tutorial videos into WildGUI — 12M grounded GUI interaction trajectories across 1,500+ apps and sites — and pretraining Qwen2.5-VL and Mimo-VL on it lifts GUI benchmarks by 5-20%.