Topics

Vision-Language-Action

Models that map perception and language directly to robot actions.

A robot in a lab-like environment

Vision-language-action models try to make robots learn from the same semantic world that language and vision models already use. The goal is not only recognizing objects or parsing instructions, but producing continuous or tokenized actions that work on real hardware.

This topic is early but strategically important. RT-2 shows how web-scale vision-language knowledge can transfer into robot control by representing actions in a language-like format. π0 pushes toward a more general robot policy with flow matching over continuous actions. The hard open questions are data collection, safety, embodiment transfer, latency, and whether a single policy can handle the physical diversity of real robots.

Start here

Foundational papers

Recent papers