Agent Memory · National University of Singapore
EvoArena turns static agent tasks into evolving chains and finds current agents average only 39.6% accuracy; EvoMem adds patch memory and improves chain-level accuracy by 3.7 points.
AI Agents · Google DeepMind
Google DeepMind's report lays out four non-exclusive paths from AGI to ASI and treats each bottleneck, from data walls to regulation, as an open research question.
AI Agents · NVIDIA
SpatialClaw replaces rigid tool calls with a persistent Python kernel and reaches 59.9% average accuracy across 20 spatial reasoning benchmarks, +11.2 points over the recent spatial-agent baseline.
AI Agents · Renmin University of China
Arbor stores research attempts in a persistent hypothesis tree, then admits changes only through held-out evaluation. It reports best held-out results on six AO tasks and 86.36% Any Medal on MLE-Bench Lite.
AI Agents · TokenRhythm Technologies
Claw-SWE-Bench evaluates OpenClaw-style coding-agent harnesses on 350 GitHub issue tasks. OpenClaw jumps from 19.1% to 73.4% Pass@1 with a full adapter.
AI Agents · Independent Researcher
AdaPlanBench: Testing Adaptive Planning in LLM Agents turns adaptive planning under constraints into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
AI Agents · UC Berkeley
Agents' Last Exam tests AI agents on 1,490 expert-built professional tasks across 55 digital industries; the hardest tier averages only 2.6% full pass.
AI Agents · Independent Researcher
ArcANE: Measuring When Role-Playing Agents Break Character turns role-playing language agent reliability into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
Video Generation · Nanjing University
CoVEBench: Can Video Editors Follow Complex Instructions? turns complex instruction following for video editing into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
AI Agents · Nanjing University
TELBench asks models to find the span that broke a 12-step research trajectory. DRIFT audits claims against evidence, lifting macro-F1 to 54.91% with Claude-Sonnet-4.6, up to 30 points over raw inspection.
AI Agents · University of Illinois Urbana-Champaign
Harness-1 is a 20B RL search agent that hands working memory to the environment, hitting 0.730 average curated recall and beating the next open subagent by +11.4 points.
AI Agents · Independent Researcher
K-BrowseComp: Korean Web-Browsing Agent Benchmark turns Korean-context web browsing agents into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
AI Agents · Shanghai Jiao Tong University
A hypernetwork compiles a textual skill into a LoRA adapter in one forward pass. On ALFWorld, LatentSkill lifts success by 21.4 points (seen) with 64.1% fewer prefill tokens.
AI Agents · Independent Researcher
When Masking Stale Observations Helps Search Agents turns context management for search agents into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
AI Agents · Lehigh University
OpenSkill lets agents build skills and their own verifiers from the open web, hitting 43.6% on SkillsBench (+8.9 over the best baseline) with zero target-task answers.
AI Agents · Shanghai AI Laboratory
ResearchClawBench: Testing Autonomous Research Agents turns end-to-end scientific research agents into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
Reinforcement Learning · Tsinghua University
CHERRL injects four known judge biases to reliably reproduce reward hacking in rubric RL; an agent reading only training logs pinned the onset with 11-step total interval error and missed none of six runs.
AI Agents · Xiamen University
SAAS uses self-aware RL to cut a Qwen2.5-7B search agent's average queries from 2.19 to 0.97 per question, while keeping accuracy near the best baseline (48.7% vs 49.8%).
Reinforcement Learning · University of Edinburgh
SCOPE co-evolves a task-writing Challenger and a retrieval Solver, judged by a frozen copy of the base model, lifting eight open-ended benchmarks by up to +10.4 points with zero curated prompts.
AI Agents · Ant Group
SkillAdaptor edits an agent's skill library from failed trajectories without touching model weights, lifting WebShop score +2.3 and PinchBench +1.5 over the frozen backbone.
AI Agents · Independent Researcher
SoCRATES: Evaluating Proactive LLM Mediation turns proactive mediation agents into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
AI Agents · Independent Researcher
SpatialWorld: Interactive Spatial Reasoning for Agents turns interactive spatial reasoning into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
AI Agents · HKUST
StreamMA pipes each reasoning step to the next agent the moment it is written, not after the full chain. Across 8 benchmarks it gains +7.3 pp on average (max +22.4 pp on HMMT 2026) and runs up to 26.9x faster.
AI Agents · Shanghai Jiao Tong University
SWE-Explore isolates the repo-exploration stage of coding agents over 848 issues. Agentic explorers crush BM25 (HitFile 0.65 vs 0.08), but line-level recall stalls at 0.15-0.20, and that gap is what limits repairs.
AI Agents · Independent Researcher
TASTE: Harder Agent Benchmarks from Tool Sequences turns tool-use benchmark generation into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
AI Agents · Independent Researcher
TIDE: Proactive Multi-Problem Discovery with Templates turns proactive problem discovery into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
AI Agents · Independent Researcher
ToolMaze: When LLM Agents Must Replan After Tool Failures turns dynamic replanning after tool failures into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
Multimodal Models · The Chinese University of Hong Kong
X-Stream is the first benchmark for watching several live video streams at once. The best model, Gemini 3 Pro, hits 49.6% versus a 91.84% human baseline, and proactive ability collapses below 21%.
Agent Memory · UC Berkeley
MemGPT borrows OS virtual memory — it lets the LLM page data in and out of its own context with function calls, lifting deep memory retrieval to 93.4% with GPT-4 vs 35.3% for recursive summarization.
AI Agents · Shanghai AI Laboratory
AgentDoG 1.5 trains 0.8B-8B agent-safety guard models on only ~1k samples, hits 92.2% accuracy on R-Judge with the 4B variant, rivals GPT-5.4, and cuts agentic-RL deployment overhead by two orders of magnitude.
AI Agents · Shanghai Jiao Tong University
ARIS is an open-source autonomous-research harness pairing a Claude-family executor with a GPT-family reviewer to attack the failure it calls 'plausible unsupported success', with 65+ skills and a three-stage audit.
AI Agents · UNC-Chapel Hill
AutoResearchClaw is a 23-stage multi-agent system for autonomous ML research. It scores 0.648 vs AI Scientist v2's 0.419 on its 25-topic ARC-Bench, and rises to 7.27/10 quality with a human in the loop.
AI Agents · Shanghai AI Laboratory
Pi-Bench scores agents on proactivity, not just task completion, across 100 long-horizon tasks. The best model, GPT-5.4, hits only 67.0% proactivity, and removing prior sessions drops it 9.5 points.
AI Agents · University of Waterloo
Direct Corpus Interaction (DCI) lets a search agent grep the raw corpus instead of calling a retriever. On BrowseComp-Plus it lifts accuracy from 69.0% to 80.0% while cutting cost 29.4%.
AI Agents · University of Illinois Urbana-Champaign
This survey reframes code not as a thing agents generate but as the executable substrate they run on, mapping 40-plus systems across three layers — interface, mechanisms, multi-agent scaling — plus seven open problems.
AI Agents · Shanghai AI Laboratory
COLLEAGUE.SKILL distills one person's work traces into a versioned skill package with two tracks — capability and bounded behavior — that any agent can install, correct, and roll back. The open repo reports ~18.5k stars.
Multimodal Models · University of Illinois Urbana-Champaign
Crafter wraps an image model in five cooperating agents and scores 50.34 on PaperBanana-Bench vs 11.13 for the raw backbone — then CraftEditor turns the raster output into editable SVG you can actually fix.
Long Context · University of Illinois Urbana-Champaign
Ctx2Skill is a self-play framework that discovers natural-language skills from a long context with no human labels or external rewards, lifting GPT-4.1 from 11.1% to 16.5% and GPT-5.1 from 21.2% to 25.8% on CL-bench.
World Models · NVIDIA
Gamma-World is NVIDIA's video world model for multiplayer simulation that runs at 24 FPS and generalizes from two to four players with no retraining, cutting Solaris's FVD roughly in half.
AI Agents · University of Illinois Urbana-Champaign
Eywa lets an LLM agent invoke domain models like Chronos and TabPFN through a learned interface instead of serializing data into text. On EywaBench it lifts utility from 0.6154 to 0.6558 while cutting ~30% tokens.
AI Agents · MemTensor
MemPrivacy swaps sensitive spans for type-aware placeholders on-device, processes memory in the cloud over them, then restores them locally — utility loss stays within 1.6% and 0.6B-4B models beat GPT-5.2 at detection.
AI Agents · Shanghai Jiao Tong University
MMSkills packages textual procedures, runtime state cards, and keyframes into reusable skills for visual agents, lifting Qwen3-VL-235B from 21.34% to 39.17% on OSWorld and a small 8B model from 10.78% to 25.40%.
Multimodal Models · Sea AI Lab
OpenSearch-VL open-sources data, code, and weights for vision-language search agents that call real search, OCR, and image tools — its 30B-A3B model lifts seven benchmarks by 13.8 points on average over Qwen3-VL.
AI Agents · Zhejiang University
SDAR adds a gated, token-level self-distillation signal from a skill-augmented teacher on top of GRPO, lifting multi-turn agents by up to +10.2 points on WebShop and +9.4 on ALFWorld for small Qwen models.
AI Agents · University of Science and Technology of China
Skill1 trains a single Qwen2.5-7B policy to retrieve, apply, and create reusable skills under one task-outcome reward — reaching 97.5% on ALFWorld, 6.5 points over the strongest RL-only baseline.
AI Agents · Microsoft Research
SkillOpt trains a single skill document for a frozen LLM agent with bounded add/delete/replace edits and a held-out gate, lifting GPT-5.5 by +23.5 points in direct chat across six benchmarks.
AI Agents · MemTensor
SkillsVote treats agent skills as a governed library — profiling a million-scale corpus, recommending skills before a run, and gating updates after. Offline evolution lifts GPT-5.2 on Terminal-Bench 2.0 by up to 7.9 pp.
AI Agents · Peking University
Video2GUI turns 500M unlabeled tutorial videos into WildGUI — 12M grounded GUI interaction trajectories across 1,500+ apps and sites — and pretraining Qwen2.5-VL and Mimo-VL on it lifts GUI benchmarks by 5-20%.