Topics

AI Agents

LLM-driven systems that plan, act, use tools, and carry skills across tasks.

An autonomous agent workflow on a dark screen

AI agents wrap a language model in a loop of planning, tool use, memory, and action, turning a one-shot responder into a system that can pursue a goal over many steps. The research that matters is less about any single model and more about how agents reason, call tools, recover from errors, and carry reusable skills between tasks.

This topic tracks the shift from clever prompting to durable infrastructure: ReAct interleaved reasoning with actions, Toolformer taught models to call APIs, and skill-packaging systems like COLLEAGUE.SKILL turn expertise into portable, correctable artifacts. The hard open questions are reliability, evaluation, safety bounds, and how to author and maintain skills at scale.

Foundational papers

Agent Memory · UC Berkeley

MemGPT: Treating the LLM Context Window Like an Operating System

MemGPT borrows OS virtual memory — it lets the LLM page data in and out of its own context with function calls, lifting deep memory retrieval to 93.4% with GPT-4 vs 35.3% for recursive summarization.

Long Context · University of Illinois Urbana-Champaign

From Context to Skills: Ctx2Skill Self-Evolves Context Learning

Ctx2Skill is a self-play framework that discovers natural-language skills from a long context with no human labels or external rewards, lifting GPT-4.1 from 11.1% to 16.5% and GPT-5.1 from 21.2% to 25.8% on CL-bench.

AI Agents · University of Illinois Urbana-Champaign

Eywa: Letting LLM Agents Call Scientific Foundation Models

Eywa lets an LLM agent invoke domain models like Chronos and TabPFN through a learned interface instead of serializing data into text. On EywaBench it lifts utility from 0.6154 to 0.6558 while cutting ~30% tokens.

AI Agents · University of Waterloo

Direct Corpus Interaction: Letting Agents grep Instead of a Retriever

Direct Corpus Interaction (DCI) lets a search agent grep the raw corpus instead of calling a retriever. On BrowseComp-Plus it lifts accuracy from 69.0% to 80.0% while cutting cost 29.4%.

Recent papers

Multimodal Models · Shanghai AI Laboratory

InternVideo3: An 8B Video Agent with Contextual Reasoning

InternVideo3 is an 8B video model that scores 73.8 on Video-MME single-pass, then adds +2.7 from an agentic reasoning loop on top. M2LA attention holds 768K tokens on one H200 where the base OOMs at 512K.

Robotics · UC Berkeley

Playful Agentic Robot Learning: RATs Learn Skills Before the Task

RATs lets a robot play before it is told what to do: agent teams propose their own goals, write code-as-policy, and bank wins as reusable skills. The library lifts LIBERO-PRO success from 23.2% to 43.8%.

Code Generation · Renmin University of China

DeNovoSWE: 4,818 Auto-Built Repos to Train Whole-Repo Code Agents

DeNovoSWE auto-constructs 4,818 verifiable whole-repository generation tasks. Fine-tuning Qwen3-30B-A3B on them lifts BeyondSWE-Doc2Repo pass rate from 0.058 to 0.472.

AI Agents · The Chinese University of Hong Kong

Orchestra-o1: Omnimodal Agent Orchestration

Orchestra-o1 orchestrates text, image, audio, and video sub-agents and hits 72.8% on OmniGAIA with a GPT-5 brain (+10.3 over Gemini-3-Pro). Its trained 8B orchestrator reaches 30.0%, best among open omnimodal agents.

AI Agents · Tencent

From Chatbot to Digital Colleague: A Survey of Persistent Autonomous AI

A Tencent YouTu Lab survey maps the chatbot-to-agent shift on two axes: cognitive core (Chatbot then Thinking LLM) and task execution (Agent then Workspace plus Skill), arguing persistent state is the real leap.

AI Agents · City University of Hong Kong

RHO: Retrospective Harness Optimization via Self-Preference

RHO tunes an LLM agent harness from past unlabeled trajectories using self-consistency and pairwise self-preference, lifting SWE-Bench Pro from 59% to 78% in one round with no external grading.

Multimodal Models · Shanghai AI Laboratory

InternVideo3: An 8B Video Agent with Contextual Reasoning

Robotics · UC Berkeley

Playful Agentic Robot Learning: RATs Learn Skills Before the Task

Code Generation · Renmin University of China

DeNovoSWE: 4,818 Auto-Built Repos to Train Whole-Repo Code Agents

DeNovoSWE auto-constructs 4,818 verifiable whole-repository generation tasks. Fine-tuning Qwen3-30B-A3B on them lifts BeyondSWE-Doc2Repo pass rate from 0.058 to 0.472.

AI Agents · The Chinese University of Hong Kong

Orchestra-o1: Omnimodal Agent Orchestration

AI Agents · Tencent

From Chatbot to Digital Colleague: A Survey of Persistent Autonomous AI

AI Agents · City University of Hong Kong

RHO: Retrospective Harness Optimization via Self-Preference

RHO tunes an LLM agent harness from past unlabeled trajectories using self-consistency and pairwise self-preference, lifting SWE-Bench Pro from 59% to 78% in one round with no external grading.

AI Agents · Ant Group

SearchSwarm: Delegation Intelligence for Deep Research

SearchSwarm fine-tunes Tongyi DeepResearch-30B-A3B on harness-generated delegation trajectories, lifting BrowseComp from 43.4 to 68.1 and topping every 30B-A3B model on four deep-research benchmarks.

AI Agents · CASIA

Agentic Environment Engineering for LLMs: A Survey of the Field

A CASIA survey maps agentic environments for LLM agents along eight attribute axes and eight domains, unifying synthesis, evaluation, and co-evolution. Sharpest finding: environments barely fit multi-agent settings.

Reinforcement Learning · Alibaba Qwen Team

APPO: Agentic Procedural Policy Optimization for RL Agents

APPO branches RL rollouts at high-uncertainty, high-influence tokens instead of tool-call boundaries, lifting Qwen2.5-7B by 3.9 points over ARPO across 13 math, multi-hop, and deep-search benchmarks.

Agent Memory · National University of Singapore

MRAgent: Graph Memory That Reconstructs Instead of Retrieves

MRAgent gives LLM agents a Cue-Tag-Content memory graph and lets the model reason while it traverses it, lifting LoCoMo LLM-Judge from 68.3 to 84.2 while cutting tokens to 118k per sample.

AI Agents · University of Science and Technology of China

Role-Agent: One LLM Plays Both Agent and Its Own Environment

Role-Agent makes a single LLM act as agent and environment at once, generating its own process reward and curriculum. It beats GiGPO by 4.2% on ALFWorld and 6.9% on WebShop with Qwen2.5-1.5B.

AI Agents · Tsinghua University

EurekAgent: Environment Engineering for AI Science

EurekAgent argues the bottleneck is agent environment design, reporting 2.635999 on 26-circle packing, 2005.03 us TriMul, and 85.71% any-medal on a seven-task MLE-Bench subset.

AI Agents · Renmin University of China

FORT-Searcher: Training Search Agents Without Shortcuts

FORT-Searcher trains a 3B-active search agent on shortcut-resistant tasks, scoring 66.2 overall among comparable open agents and delaying answer hit time from 18.7 to 46.9 versus REDSearcher data.

AI Agents · Zhejiang University

WeaveBench: Hybrid Computer-Use Agents Still Fail

WeaveBench tests 114 real hybrid GUI plus CLI tasks; the best frontier pairing reaches only 41.2% PassRate, and final-only grading overstates GPT-5.5 by 20.2 points.

Agent Memory · National University of Singapore

EvoArena: Why Agent Memory Must Track Environment Changes

EvoArena turns static agent tasks into evolving chains and finds current agents average only 39.6% accuracy; EvoMem adds patch memory and improves chain-level accuracy by 3.7 points.

AI Agents · Google DeepMind

From AGI to ASI: DeepMind's Map of Superintelligence Pathways

Google DeepMind's report lays out four non-exclusive paths from AGI to ASI and treats each bottleneck, from data walls to regulation, as an open research question.

AI Agents · NVIDIA

SpatialClaw: Why VLM Spatial Agents Need a Python Workspace

SpatialClaw replaces rigid tool calls with a persistent Python kernel and reaches 59.9% average accuracy across 20 spatial reasoning benchmarks, +11.2 points over the recent spatial-agent baseline.

AI Agents · Renmin University of China

Arbor: Autonomous Research With Hypothesis Trees

Arbor stores research attempts in a persistent hypothesis tree, then admits changes only through held-out evaluation. It reports best held-out results on six AO tasks and 86.36% Any Medal on MLE-Bench Lite.

AI Agents · TokenRhythm Technologies

Claw-SWE-Bench: Why Coding Agent Harnesses Matter

Claw-SWE-Bench evaluates OpenClaw-style coding-agent harnesses on 350 GitHub issue tasks. OpenClaw jumps from 19.1% to 73.4% Pass@1 with a full adapter.

AI Agents · Independent Researcher

AdaPlanBench: Testing Adaptive Planning in LLM Agents

AdaPlanBench: Testing Adaptive Planning in LLM Agents turns adaptive planning under constraints into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

AI Agents · UC Berkeley

Agents' Last Exam: Why AI Agents Still Fail at Work

Agents' Last Exam tests AI agents on 1,490 expert-built professional tasks across 55 digital industries; the hardest tier averages only 2.6% full pass.

AI Agents · Independent Researcher

ArcANE: Measuring When Role-Playing Agents Break Character

ArcANE: Measuring When Role-Playing Agents Break Character turns role-playing language agent reliability into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Video Generation · Nanjing University

CoVEBench: Can Video Editors Follow Complex Instructions?

CoVEBench: Can Video Editors Follow Complex Instructions? turns complex instruction following for video editing into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

AI Agents · Nanjing University

DRIFT: Pinpointing Where Deep-Research Agents Go Wrong

TELBench asks models to find the span that broke a 12-step research trajectory. DRIFT audits claims against evidence, lifting macro-F1 to 54.91% with Claude-Sonnet-4.6, up to 30 points over raw inspection.

AI Agents · University of Illinois Urbana-Champaign

Harness-1: Move Search-Agent Bookkeeping Out of the Policy

Harness-1 is a 20B RL search agent that hands working memory to the environment, hitting 0.730 average curated recall and beating the next open subagent by +11.4 points.

AI Agents · Independent Researcher

K-BrowseComp: Korean Web-Browsing Agent Benchmark

K-BrowseComp: Korean Web-Browsing Agent Benchmark turns Korean-context web browsing agents into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

AI Agents · Shanghai Jiao Tong University

LatentSkill: Bake Agent Skills Into LoRA Weights, Not the Prompt

A hypernetwork compiles a textual skill into a LoRA adapter in one forward pass. On ALFWorld, LatentSkill lifts success by 21.4 points (seen) with 64.1% fewer prefill tokens.

AI Agents · Independent Researcher

When Masking Stale Observations Helps Search Agents

When Masking Stale Observations Helps Search Agents turns context management for search agents into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

AI Agents · Lehigh University

OpenSkill: Self-Evolving LLM Agents With No Task Supervision

OpenSkill lets agents build skills and their own verifiers from the open web, hitting 43.6% on SkillsBench (+8.9 over the best baseline) with zero target-task answers.

AI Agents · Shanghai AI Laboratory

ResearchClawBench: Testing Autonomous Research Agents

ResearchClawBench: Testing Autonomous Research Agents turns end-to-end scientific research agents into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Reinforcement Learning · Tsinghua University

CHERRL: A Controlled Sandbox for Reward Hacking in Rubric RL

CHERRL injects four known judge biases to reliably reproduce reward hacking in rubric RL; an agent reading only training logs pinned the onset with 11-step total interval error and missed none of six runs.

AI Agents · Xiamen University

SAAS: Teaching Search Agents When to Stop Searching

SAAS uses self-aware RL to cut a Qwen2.5-7B search agent's average queries from 2.19 to 0.97 per question, while keeping accuracy near the best baseline (48.7% vs 49.8%).

Reinforcement Learning · University of Edinburgh

SCOPE: Self-Play RL That Trains LLMs on Open-Ended Tasks

SCOPE co-evolves a task-writing Challenger and a retrieval Solver, judged by a frozen copy of the base model, lifting eight open-ended benchmarks by up to +10.4 points with zero curated prompts.

AI Agents · Ant Group

SkillAdaptor: How LLM Agents Rewrite Their Own Skills

SkillAdaptor edits an agent's skill library from failed trajectories without touching model weights, lifting WebShop score +2.3 and PinchBench +1.5 over the frozen backbone.

AI Agents · Independent Researcher

SoCRATES: Evaluating Proactive LLM Mediation

SoCRATES: Evaluating Proactive LLM Mediation turns proactive mediation agents into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

AI Agents · Independent Researcher

SpatialWorld: Interactive Spatial Reasoning for Agents

SpatialWorld: Interactive Spatial Reasoning for Agents turns interactive spatial reasoning into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

AI Agents · HKUST

StreamMA: Streaming Beats Waiting in Multi-Agent Reasoning

StreamMA pipes each reasoning step to the next agent the moment it is written, not after the full chain. Across 8 benchmarks it gains +7.3 pp on average (max +22.4 pp on HMMT 2026) and runs up to 26.9x faster.

AI Agents · Shanghai Jiao Tong University

SWE-Explore: Can Coding Agents Find the Right Code?

SWE-Explore isolates the repo-exploration stage of coding agents over 848 issues. Agentic explorers crush BM25 (HitFile 0.65 vs 0.08), but line-level recall stalls at 0.15-0.20, and that gap is what limits repairs.

AI Agents · Independent Researcher

TASTE: Harder Agent Benchmarks from Tool Sequences

TASTE: Harder Agent Benchmarks from Tool Sequences turns tool-use benchmark generation into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

AI Agents · Independent Researcher

TIDE: Proactive Multi-Problem Discovery with Templates

TIDE: Proactive Multi-Problem Discovery with Templates turns proactive problem discovery into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

AI Agents · Independent Researcher

ToolMaze: When LLM Agents Must Replan After Tool Failures

ToolMaze: When LLM Agents Must Replan After Tool Failures turns dynamic replanning after tool failures into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Multimodal Models · The Chinese University of Hong Kong

X-Stream: Why MLLMs Score ~50% on Multi-Stream Video

X-Stream is the first benchmark for watching several live video streams at once. The best model, Gemini 3 Pro, hits 49.6% versus a 91.84% human baseline, and proactive ability collapses below 21%.

Agent Memory · UC Berkeley

MemGPT: Treating the LLM Context Window Like an Operating System

AI Agents · Shanghai AI Laboratory

AgentDoG 1.5: A Lightweight Guardrail for AI Agent Safety

AgentDoG 1.5 trains 0.8B-8B agent-safety guard models on only ~1k samples, hits 92.2% accuracy on R-Judge with the 4B variant, rivals GPT-5.4, and cuts agentic-RL deployment overhead by two orders of magnitude.

AI Agents · Shanghai Jiao Tong University

ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration

ARIS is an open-source autonomous-research harness pairing a Claude-family executor with a GPT-family reviewer to attack the failure it calls 'plausible unsupported success', with 65+ skills and a three-stage audit.

AI Agents · UNC-Chapel Hill

AutoResearchClaw: An AI Research Agent That Beats AI Scientist v2

AutoResearchClaw is a 23-stage multi-agent system for autonomous ML research. It scores 0.648 vs AI Scientist v2's 0.419 on its 25-topic ARC-Bench, and rises to 7.27/10 quality with a human in the loop.

AI Agents · Shanghai AI Laboratory

Pi-Bench: Can AI Assistants Anticipate What You Did Not Say?

Pi-Bench scores agents on proactivity, not just task completion, across 100 long-horizon tasks. The best model, GPT-5.4, hits only 67.0% proactivity, and removing prior sessions drops it 9.5 points.

AI Agents · University of Waterloo

Direct Corpus Interaction: Letting Agents grep Instead of a Retriever

Direct Corpus Interaction (DCI) lets a search agent grep the raw corpus instead of calling a retriever. On BrowseComp-Plus it lifts accuracy from 69.0% to 80.0% while cutting cost 29.4%.

AI Agents · University of Illinois Urbana-Champaign

Code as Agent Harness: Reframing Code as the Runtime of AI Agents

This survey reframes code not as a thing agents generate but as the executable substrate they run on, mapping 40-plus systems across three layers (interface, mechanisms, multi-agent scaling) plus seven open problems.

AI Agents · Shanghai AI Laboratory

COLLEAGUE.SKILL: Turning One Person's Expertise Into a Portable AI Skill

COLLEAGUE.SKILL distills one person's work traces into a versioned skill package with two tracks, capability and bounded behavior, that any agent can install, correct, and roll back. The open repo reports ~18.5k stars.

Multimodal Models · University of Illinois Urbana-Champaign

Crafter: A Multi-Agent Harness for Editable Scientific Figures

Crafter wraps an image model in five cooperating agents and scores 50.34 on PaperBanana-Bench vs 11.13 for the raw backbone; then CraftEditor turns the raster output into editable SVG you can actually fix.

Long Context · University of Illinois Urbana-Champaign

From Context to Skills: Ctx2Skill Self-Evolves Context Learning

World Models · NVIDIA

Gamma-World: A Multi-Agent World Model That Scales Past Two Players

Gamma-World is NVIDIA's video world model for multiplayer simulation that runs at 24 FPS and generalizes from two to four players with no retraining, cutting Solaris's FVD roughly in half.

AI Agents · University of Illinois Urbana-Champaign

Eywa: Letting LLM Agents Call Scientific Foundation Models

AI Agents · MemTensor

MemPrivacy: Private Edge-Cloud Agent Memory via Reversible Placeholders

MemPrivacy swaps sensitive spans for type-aware placeholders on-device, processes memory in the cloud over them, then restores them locally — utility loss stays within 1.6% and 0.6B-4B models beat GPT-5.2 at detection.

AI Agents · Shanghai Jiao Tong University

MMSkills: Multimodal Skill Packages for General Visual Agents

MMSkills packages textual procedures, runtime state cards, and keyframes into reusable skills for visual agents, lifting Qwen3-VL-235B from 21.34% to 39.17% on OSWorld and a small 8B model from 10.78% to 25.40%.

Multimodal Models · Sea AI Lab

OpenSearch-VL: An Open Recipe for Multimodal Search Agents

OpenSearch-VL open-sources data, code, and weights for vision-language search agents that call real search, OCR, and image tools — its 30B-A3B model lifts seven benchmarks by 13.8 points on average over Qwen3-VL.

AI Agents · Zhejiang University

Self-Distilled Agentic RL: A Privileged Teacher Steering GRPO Per Token

SDAR adds a gated, token-level self-distillation signal from a skill-augmented teacher on top of GRPO, lifting multi-turn agents by up to +10.2 points on WebShop and +9.4 on ALFWorld for small Qwen models.

AI Agents · University of Science and Technology of China

Skill1: One RL Policy That Selects, Uses, and Distills Agent Skills

Skill1 trains a single Qwen2.5-7B policy to retrieve, apply, and create reusable skills under one task-outcome reward — reaching 97.5% on ALFWorld, 6.5 points over the strongest RL-only baseline.

AI Agents · Microsoft Research

SkillOpt: Training a Frozen Agent's Skill Text Like a Model

SkillOpt trains a single skill document for a frozen LLM agent with bounded add/delete/replace edits and a held-out gate, lifting GPT-5.5 by +23.5 points in direct chat across six benchmarks.

AI Agents · MemTensor

SkillsVote: Governing the Lifecycle of Reusable Agent Skills

SkillsVote treats agent skills as a governed library — profiling a million-scale corpus, recommending skills before a run, and gating updates after. Offline evolution lifts GPT-5.2 on Terminal-Bench 2.0 by up to 7.9 pp.

AI Agents · Peking University

Video2GUI: Mining 12M GUI Agent Trajectories From Internet Videos

Video2GUI turns 500M unlabeled tutorial videos into WildGUI — 12M grounded GUI interaction trajectories across 1,500+ apps and sites — and pretraining Qwen2.5-VL and Mimo-VL on it lifts GUI benchmarks by 5-20%.

Foundational papers

Recent papers

Related topics