Topics

Vision-Language-Action

Models that map perception and language directly to robot actions.

Vision-language-action models try to make robots learn from the same semantic world that language and vision models already use. The goal is not only recognizing objects or parsing instructions, but producing continuous or tokenized actions that work on real hardware.

This topic is early but strategically important. RT-2 shows how web-scale vision-language knowledge can transfer into robot control by representing actions in a language-like format. π0 pushes toward a more general robot policy with flow matching over continuous actions. The hard open questions are data collection, safety, embodiment transfer, latency, and whether a single policy can handle the physical diversity of real robots.

Start here

Vision-Language-Action · Physical Intelligence

π0 Explained: A Vision-Language-Action Flow Model for Robots

π0 bolts a flow-matching action expert onto a pretrained VLM, emitting ~50Hz action chunks so one policy can fold laundry, bus tables, and assemble boxes across single-arm, dual-arm, and mobile robots.

Vision-Language-Action · Google DeepMind

RT-2 Explained: Vision-Language-Action Models for Robot Control

RT-2 co-fine-tunes a web-pretrained vision-language model on robot trajectories, expresses actions as text tokens, and gets emergent generalization to novel objects, unseen commands, and basic reasoning across 6k trials.

Foundational papers

Vision-Language-Action · Google DeepMind

RT-2 Explained: Vision-Language-Action Models for Robot Control

Vision-Language-Action · Physical Intelligence

π0 Explained: A Vision-Language-Action Flow Model for Robots

Vision-Language-Action · Allen Institute for AI

MolmoAct2: An Open Action Reasoning Stack for Real Robots

MolmoAct2 is an open vision-language-action stack that reasons in 3D before acting. On real-world DROID it hits 87.1% success, +38.7 points over the runner-up, and its Molmo2-ER brain beats GPT-5 and Gemini Robotics ER.

Vision-Language-Action · RLWRLD

RLDX-1: A Multi-Stream Vision-Language-Action Model for Dexterous Robots

RLDX-1, from RLWRLD and KAIST, adds motion, memory and tactile streams to a Qwen3-VL backbone. It catches fast-moving objects 87.5% of the time vs 29.2% for pi0.5, and beats GR00T N1.6 on LIBERO-Plus 86.7% to 72.6%.

Recent papers

Robotics · Peking University

DragMesh-2: Dexterous Articulated Manipulation Through Contact

DragMesh-2 opens doors and drawers with a 51-DoF hand and no actuator on the object joint, so motion comes only from contact. PICA training hits 0.89 success at nominal damping and 0.56 at 4x, with no tactile sensing.

Vision-Language-Action · CASIA

World Pilot: Steering a VLA Policy with World-Action Priors

World Pilot adds two world-model priors to a VLA policy and hits 84.7% total success on LIBERO-Plus zero-shot OOD, up 4.2 over the ABot-M0 base, and the scene prior works even from a world model never action-trained.

Vision-Language-Action · ACE Robotics

ACE-Ego-0: Unifying Egocentric Human and Robot Data for VLA Pretraining

ACE-Ego-0 pretrains a VLA on 6,000+ hours mixing robot trajectories with human egocentric video turned into pseudo-actions. It averages 78.3% on six real bimanual tasks vs 71.7% for pi-0.5 and 35.6% for GR00T-N1.7.

Vision-Language-Action · X Square Robot

WALL-WM: Event-Grounded World Action Modeling for Robots

WALL-WM organizes VLA pretraining around semantic action events, not fixed-length chunks. Its event mode scores 75.86 Task Progress on diverse real-robot manipulation versus 55.64 for pi0.5.

Vision-Language-Action · Zhejiang University

LabVLA: A VLA Model for Scientific Lab Robots

LabVLA trains a Qwen3-VL-4B backbone plus DiT action expert on laboratory workflows and reports 71.1% ID and 70.0% OOD success on LabUtopia.

World Models · Independent Researcher

AnchorWorld: Egocentric World Simulation for Embodied AI

AnchorWorld: Egocentric World Simulation for Embodied AI turns egocentric world simulation into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Robotics · Peking University

WALL-WM: Event-Grounded World Action Modeling for Robots

WALL-WM organizes VLA pretraining around semantic action events, not fixed-length chunks. Its event mode scores 75.86 Task Progress on diverse real-robot manipulation versus 55.64 for pi0.5.

Vision-Language-Action · Zhejiang University

LabVLA: A VLA Model for Scientific Lab Robots

LabVLA trains a Qwen3-VL-4B backbone plus DiT action expert on laboratory workflows and reports 71.1% ID and 70.0% OOD success on LabUtopia.

World Models · Independent Researcher

AnchorWorld: Egocentric World Simulation for Embodied AI

AnchorWorld: Egocentric World Simulation for Embodied AI turns egocentric world simulation into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Robotics · Independent Researcher

TVRBench: Can Models Move to a Target Viewpoint?

TVRBench: Can Models Move to a Target Viewpoint? turns active 3D viewpoint reproduction into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Robotics · Tsinghua University

Humanoid-GPT: A GPT-Style Transformer for Humanoid Motion Tracking

Humanoid-GPT treats humanoid control like language modeling: a causal Transformer distilled from ~384 PPO experts on a 2-billion-frame corpus, 200x prior data. It hits 92.58 percent sim success, under 1.5ms.

AI Agents · Shanghai Jiao Tong University

MMSkills: Multimodal Skill Packages for General Visual Agents

MMSkills packages textual procedures, runtime state cards, and keyframes into reusable skills for visual agents, lifting Qwen3-VL-235B from 21.34% to 39.17% on OSWorld and a small 8B model from 10.78% to 25.40%.

Vision-Language-Action · Allen Institute for AI

MolmoAct2: An Open Action Reasoning Stack for Real Robots

Vision-Language-Action · Shanghai AI Laboratory

PhysBrain 1.0: Turning Human Video into Physical Priors for Robots

PhysBrain 1.0 compiles human egocentric video into physics QA to pretrain a VLM, then adapts it to robot control — lifting Franka grasping from 47.1% to 63.3% over 50 trials versus a pi0.5 baseline.

Vision-Language-Action · Alibaba Qwen Team

Qwen-VLA: One Model for Manipulation, Navigation, and Trajectories

Qwen-VLA extends Qwen's vision-language stack with a DiT action decoder and embodiment-aware prompts to run manipulation, navigation, and trajectory prediction in one model — 97.9% on LIBERO and 69.0% OSR on R2R.

Vision-Language-Action · RLWRLD

RLDX-1: A Multi-Stream Vision-Language-Action Model for Dexterous Robots

Vision-Language-Action · ETH Zurich

Robots Need More than VLA and World Models: Four Missing Interfaces

A position paper from ETH Zurich, Stanford and TU Darmstadt argues scaling VLA and world models is not enough — robots need four interfaces to turn unstructured human and video behaviour into grounded supervision.

Vision-Language-Action · Physical Intelligence

π0 Explained: A Vision-Language-Action Flow Model for Robots

Vision-Language-Action · Google DeepMind

Start here

Foundational papers

Recent papers

Related topics