Multimodal Models · Peking University
Watch, Remember, Reason: A Human-View Map of Video MLLMs
A survey that reframes long-video MLLMs as three abilities (watch, remember, reason), comparing against 11 prior surveys and organizing 100+ methods plus 5 application domains.