Institution

Shanghai AI Laboratory

A Chinese AI research institute working on large models, agents, and AI safety.

Multimodal Models · Shanghai AI Laboratory

InternVideo3: An 8B Video Agent with Contextual Reasoning

InternVideo3 is an 8B video model that scores 73.8 on Video-MME single-pass, then adds +2.7 from an agentic reasoning loop on top. M2LA attention holds 768K tokens on one H200 where the base OOMs at 512K.

Multimodal Models · Nanjing University

HYDRA-X: One Visual Tokenizer for Images and Video

HYDRA-X unifies image and video tokenization in one ViT; tubelet attention and hierarchical temporal patchify improve DAVIS rFVD to 11.19 and editing overall to 4.34.

Vision-Language-Action · Zhejiang University

LabVLA: A VLA Model for Scientific Lab Robots

LabVLA trains a Qwen3-VL-4B backbone plus DiT action expert on laboratory workflows and reports 71.1% ID and 70.0% OOD success on LabUtopia.

Multimodal Models · Shanghai AI Laboratory

OVO-S-Bench: Streaming Spatial Intelligence in MLLMs

OVO-S-Bench: Streaming Spatial Intelligence in MLLMs turns streaming spatial intelligence into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

AI Agents · Shanghai AI Laboratory

ResearchClawBench: Testing Autonomous Research Agents

ResearchClawBench: Testing Autonomous Research Agents turns end-to-end scientific research agents into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

LLM Reasoning · Shanghai AI Laboratory

ThoughtFold: Cutting 56% of Reasoning Tokens Without Losing Accuracy

ThoughtFold trims the redundant reasoning of DeepSeek-R1-Distill-Qwen-7B by about 56% of tokens while keeping accuracy on AIME, MATH-500, and GPQA-Diamond intact, using a masked preference objective.

Efficient AI · Shanghai AI Laboratory

Draft-OPD: On-Policy Distillation Pushes Speculative Decoding Past 5x

Draft-OPD trains speculative draft models on states their own drafting induces, not just target transcripts. On Qwen3 thinking models it hits 4.86x to 4.89x, beating EAGLE-3 by 23 percent and DFlash by 13 percent.

LLM Reasoning · Shanghai AI Laboratory

SU-01: Gold-Medal Olympiad Reasoning from a 30B Open Model

SU-01, a 30B-A3B open model from Shanghai AI Lab, hits 35 points on IMO 2025 and clears gold lines at IPhO 2024/2025 using only ~338K short SFT trajectories plus a 200-step two-stage RL pipeline.

AI Agents · Shanghai AI Laboratory

AgentDoG 1.5: A Lightweight Guardrail for AI Agent Safety

AgentDoG 1.5 trains 0.8B-8B agent-safety guard models on only ~1k samples, hits 92.2% accuracy on R-Judge with the 4B variant, rivals GPT-5.4, and cuts agentic-RL deployment overhead by two orders of magnitude.

AI Agents · Shanghai AI Laboratory

Pi-Bench: Can AI Assistants Anticipate What You Did Not Say?

Pi-Bench scores agents on proactivity, not just task completion, across 100 long-horizon tasks. The best model, GPT-5.4, hits only 67.0% proactivity, and removing prior sessions drops it 9.5 points.

Multimodal Models · Shanghai AI Laboratory

CiteVQA: A Benchmark That Catches Document AI Citing the Wrong Evidence

CiteVQA makes document QA models return bounding-box citations with every answer. The top model scores 76.0 Strict Attributed Accuracy; the best open model just 22.5. Most answer right but cite the wrong region.

AI Agents · Shanghai AI Laboratory

COLLEAGUE.SKILL: Turning One Person's Expertise Into a Portable AI Skill

COLLEAGUE.SKILL distills one person's work traces into a versioned skill package with two tracks, capability and bounded behavior, that any agent can install, correct, and roll back. The open repo reports ~18.5k stars.

Speech Recognition · Shanghai AI Laboratory

Mega-ASR: Scaling Acoustic Simulation for In-the-Wild Speech Recognition

Mega-ASR fights ASR's noise-robustness gap by synthesizing 2.4M clips across 54 compound acoustic scenarios, then training Qwen3-ASR-1.7B in two stages — cutting WER to 45.69% vs 54.01% on VOiCES R4-B-F.

Long Context · Shanghai AI Laboratory

δ-mem: An 8×8 Online Memory That Boosts Frozen LLMs

δ-mem bolts a tiny 8×8 delta-rule memory onto a frozen LLM and lifts average long-memory scores 1.10× over the backbone and 1.15× over other memory methods — no fine-tuning, no context extension.

Vision-Language-Action · Shanghai AI Laboratory

PhysBrain 1.0: Turning Human Video into Physical Priors for Robots

PhysBrain 1.0 compiles human egocentric video into physics QA to pretrain a VLM, then adapts it to robot control — lifting Franka grasping from 47.1% to 63.3% over 50 trials versus a pi0.5 baseline.