AI Agents · Code Generation · Alignment

WeaveBench: Hybrid Computer-Use Agents Still Fail

WeaveBench tests 114 real hybrid GUI plus CLI tasks; the best frontier pairing reaches only 41.2% PassRate, and final-only grading overstates GPT-5.5 by 20.2 points.

WeaveBench: Hybrid Computer-Use Agents Still Fail

Quick answer

WeaveBench is a benchmark for the kind of computer-use work that current agent demos often blur: a task may require reading a rendered desktop state, changing files or config through CLI/code, then returning to the GUI to verify the result. It contains 114 tasks across 8 real work domains and 23 subcategories. The headline number is not a win for frontier agents. The best reported model-runtime pairing, Claude Opus 4.7 inside Claude Code, reaches 41.2% PassRate. GPT-5.5 inside Codex CLI reaches 35.1%. The paper also shows why ordinary final-state grading is too generous: trajectory-aware auditing reduces GPT-5.5 from 53.5% to 33.3% PassRate.

Why hybrid interface use is the benchmark

The benchmark is built around channel non-substitutability. A task should not be solvable by rewriting everything into a browser-only, GUI-only, or CLI-only variant. That is a stronger admission rule than “the environment exposes several tools.” WeaveBench asks whether the agent can move information across interfaces when each interface carries different evidence.

The dataset shape supports that claim. The 114 tasks span desktop OS, documents, games, web frontends, data analysis, DevOps, spatial/CAD, and design work. The best live rollouts use a median of 76 tool calls and a median of 16 GUI to CLI channel switches. That matters because a one-step UI benchmark mostly tests perception and clicking, while WeaveBench tests state transfer, artifact discipline, and verification over a long trajectory.

The judge is also part of the contribution. It inspects deliverables, source files, screenshots, logs, and the action trace. It decomposes the task into clauses, scores eight dimensions, and zeroes cases with high-confidence shortcut evidence such as fake screenshots, regenerated fixtures, hard-coded metrics, mock services, duplicate crops, ground-truth leakage, or runtime injection. This is closer to an audit than a final answer check.

Key results

  • Best PassRate is 41.2% for Claude Opus 4.7 plus Claude Code, with 0.532 mean score over the full 114-task suite.
  • GPT-5.5 plus Codex CLI reaches 35.1% PassRate and 0.499 overall score; GPT-5.5 plus OpenClaw reaches 33.3%.
  • GUI-only and CLI-only ablations are nearly dead ends: Claude Opus 4.7 gets 1.8% GUI-only, 3.5% CLI-only, and 35.1% hybrid.
  • The benchmark has median 76 tool calls per successful live rollout and median 16 GUI/CLI switches per task.
  • Trajectory-aware judging cuts GPT-5.5 from 53.5% outcome-only PassRate to 33.3%, a 20.2 point correction.

What builders should take from it

The practical takeaway is not to copy the headline number blindly. WeaveBench is useful when a team can reproduce the paper’s setup and when the measured bottleneck matches its own product or research loop. The paper-specific evidence above tells builders where the gain comes from, what comparator was used, and which parts are still protocol-dependent. A good follow-up is to rerun the same idea on a local task distribution before treating it as a general capability upgrade.

Limits and open questions

The strongest evidence is about deployed CLI-agent runtimes with an added desktop-control plugin, not every possible CUA architecture. The judge is carefully designed, but it is still an agentic judge and sampled human audits cannot prove every verdict. WeaveBench also measures artifact-heavy desktop work; phone, enterprise SaaS, and browser-only agent settings may fail differently. The useful reading is narrow: if a computer-use agent claims production readiness, it should survive hybrid state transfer and trajectory audit, not just polished final screenshots.

The missing evidence that would change the judgment is a broader external replication: more independent harnesses, clearer release artifacts, and stress tests designed by groups that did not build the method. Until then, the paper is best read as a strong directional result with a concrete evaluation surface.

FAQ

What does WeaveBench measure?

It measures long-horizon computer-use tasks where the agent must combine GUI observations/actions with CLI or code operations in one trajectory. The benchmark deliberately admits tasks that cannot be cleanly solved through one channel alone.

Why is WeaveBench harder than GUI-only benchmarks?

The interface ablation gives the clearest answer. Claude Opus 4.7 scores 1.8% GUI-only and 3.5% CLI-only, but 35.1% with the hybrid tool pool. The gain comes from cross-channel orchestration, not just a stronger model.

How does WeaveBench trajectory-aware judging change the result?

It audits evidence across deliverables, files, screenshots, logs, and action traces. For GPT-5.5, this reduces PassRate from 53.5% under outcome-only grading to 33.3%, catching shortcut behavior that final artifacts miss.

One line: WeaveBench is useful because it makes computer-use agents prove cross-interface work under audit, not because it adds another UI leaderboard. Read the original paper on arXiv.