Humanoid-GPT: A GPT-Style Transformer for Humanoid Motion Tracking

Quick answer

Humanoid-GPT is a GPT-style causal Transformer that controls a full humanoid body to track reference motions zero-shot, trained on a 2-billion-frame retargeted motion corpus the authors call 200x larger than prior tracking datasets. On the simulation benchmark the large variant reaches 92.58 percent tracking success with a mean per-keypoint error of 40.99mm, and it runs in under 1.5ms per step on a single RTX 4090. The work comes from Tsinghua University with Galbot, Shanghai Jiao Tong University, Peking University, and Shanghai Qi Zhi Institute, and is accepted at CVPR 2026.

The core bet: motion tracking has been bottlenecked by tiny datasets and shallow MLP policies, so the fix is to scale both the data and the model the way language modeling did.

The problem: trackers were too small and too narrow

Prior whole-body motion trackers were shallow MLPs trained on scarce data, and they hit a documented agility-versus-generalization trade-off: a policy tuned for dynamic motions overfits them and fails on unseen clips, while a policy trained broadly goes soft on agility. With only thousands to tens of thousands of motion clips available, there was no way to train a single network that both tracks highly dynamic motion and generalizes to motions it never saw.

Humanoid-GPT reframes this as a data-and-capacity problem rather than an algorithm problem. If you can assemble enough motion and a model large enough to absorb it, a single policy should generalize the way large language models generalize across text.

The method: experts, then distill into one Transformer

The pipeline runs in three stages.

First, data curation. The authors unify the major mocap datasets — AMASS, LAFAN1, MotionMillion, PHUMA — with large-scale in-house recordings, then retarget every clip to the humanoid and apply time-warping augmentation, producing a 2-billion-frame corpus.

Second, expert training. They cluster the motions using a Harmonic Motion Embedding and train roughly 384 PPO experts, each specialized on a motion cluster with keypoint-level tracking rewards. Splitting the corpus across many experts sidesteps the agility-generalization trade-off at the expert level — each expert only has to be good at its slice.

Third, distillation. A GPT-style Transformer with causal attention is trained via DAgger to consolidate all the experts into one policy that outputs per-joint PD targets. The Transformer variants span 22.1M to 80.4M parameters. This is the step that turns 384 narrow specialists into one generalist that tracks zero-shot.

Why a causal Transformer, and why now

The causal attention is the structural claim. Motion tracking is a sequence problem — the next control target depends on the history of states and the reference trajectory — so a causal Transformer can condition on context the way an autoregressive language model conditions on previous tokens. The paper pairs that structure with scale and reports clean scaling behavior: success climbs with both model size and data size, which is exactly the property that justified scaling laws in language modeling.

Why now: humanoid hardware and large retargeted mocap corpora have only recently become available at this scale, and the “distill many RL experts into one Transformer” recipe is what lets a single deployable policy inherit the breadth of hundreds of specialists.

Key results

92.58 percent tracking success rate for Humanoid-GPT-L trained on the full 2B-token corpus, in simulation.
40.99mm mean per-keypoint position error (MPKPE); 0.0735 rad MPJPE; 0.4820 rad/s MPJVE; 0.1785 m/s root velocity error in sim.
2 billion retargeted motion frames, described as 200x larger than prior tracking datasets.
~384 PPO motion experts distilled into a single Transformer of 22.1M-80.4M parameters.
Under 1.5ms inference latency per step on an RTX 4090, i.e. real-time deployable.
Real-world dance tracking on hardware: MPJPE 0.0856-0.1180 rad and MPJVE 0.6158-1.2362 rad/s across four dance motions.

Limits and open questions

The paper notes the scaling gains between 200M and 2B tokens are marginal, which it reads as the onset of a data-limited regime for the current model capacity — so simply adding more frames at fixed model size may not keep paying off, and the largest variant here is still only ~80M parameters. Real-world errors are visibly higher than simulation errors (MPJPE roughly 0.086-0.118 rad on hardware versus 0.0735 rad in sim), confirming a sim-to-real gap the paper does not fully close. The real-world evaluation shown is limited to four dance motions, so breadth on hardware is far narrower than the simulation benchmark. And the model tracks reference motions rather than generating goals — it is a tracker, not a planner, so where the reference trajectories come from is out of scope.

FAQ

What is Humanoid-GPT?

Humanoid-GPT is a GPT-style causal Transformer for zero-shot whole-body humanoid motion tracking, distilled from roughly 384 PPO experts trained on a 2-billion-frame retargeted motion corpus. It reaches 92.58 percent tracking success in simulation and runs under 1.5ms per step on an RTX 4090.

How is Humanoid-GPT different from prior motion trackers?

Prior trackers were shallow MLPs trained on scarce data and stuck in an agility-versus-generalization trade-off. Humanoid-GPT scales both data (200x larger corpus) and structure (a causal Transformer), distilling many specialist RL experts into one policy that generalizes zero-shot.

What data does Humanoid-GPT train on?

A 2-billion-frame corpus that unifies AMASS, LAFAN1, MotionMillion, and PHUMA with large-scale in-house mocap, all retargeted to the humanoid and augmented with time-warping.

How well does Humanoid-GPT track motion?

In simulation, the large variant hits 92.58 percent success, 40.99mm mean per-keypoint error, and 0.0735 rad MPJPE. On real hardware across four dance motions, MPJPE ranges 0.0856-0.1180 rad, a measurable sim-to-real gap.

Can Humanoid-GPT run in real time on a robot?

Yes. Inference latency is under 1.5ms per step on an RTX 4090, and the paper reports real-world deployment tracking dance motions on humanoid hardware.

One line: scale the data 200x and swap the MLP for a causal Transformer, and one humanoid policy tracks dynamic motion zero-shot. Read the original paper on arXiv.