AI Agents · Reinforcement Learning · World Models

Agentic Environment Engineering for LLMs: A Survey of the Field

A CASIA survey maps agentic environments for LLM agents along eight attribute axes and eight domains, unifying synthesis, evaluation, and co-evolution. Sharpest finding: environments barely fit multi-agent settings.

Agentic Environment Engineering for LLMs: A Survey of the Field

Quick answer

This CASIA survey is the first attempt to organize agentic environments for LLM agents as one engineering lifecycle: modeling, synthesis, evaluation, and application. The useful part is the taxonomy. It classifies environments along eight attribute axes (such as symbolic vs neural, online vs offline, single-agent vs multi-agent) and across eight domains (GUI, deep research, embodied, game, web/tool, code, domain-specific, cross-domain), then catalogs roughly 145 named environments with their size, modality, and observability in side-by-side tables. If you build or train agents, read it for the map, not for new benchmark numbers. It is a survey, so it reports no experiments of its own.

The strongest claim worth quoting: existing environments are inadequately suited for multi-agent settings, and the field has not yet balanced the engineering reliability of symbolic systems against the open-ended scalability of neural ones. Those two gaps are the real contribution; the rest is structured cataloging.

The eight attributes and eight domains

The modeling section is the spine of the survey. Instead of one flat list, environments are placed on eight paired axes:

  • Symbolic vs neural (hand-built rules and APIs vs LLM-generated worlds)
  • Open-loop vs closed-loop (does the environment react to the agent’s action)
  • Online vs offline (live interaction vs replayed static trajectories)
  • MDP vs POMDP (full vs partial observability of state)
  • Deterministic vs nondeterministic transitions
  • Discrete vs continuous action and state space
  • Unimodal vs multimodal (text only vs text plus images and video)
  • Single-agent vs multi-agent

These axes matter because they predict what an environment can train. A deterministic, discrete, single-agent web task is cheap to build and easy to verify, but it cannot teach negotiation or partial-observability reasoning. The survey then sorts environments into eight domains and tabulates each one with size, supported modality (text, image, video), observability, multi-agent support, continuity, and whether it runs online, with direct GitHub or Hugging Face links. Familiar systems appear in the right cells: WebArena, WebShop, and Mind2Web under web/GUI; SWE-bench and SWE-Gym under code; ALFWorld and ScienceWorld under embodied; Voyager and Crafter under game.

Symbolic synthesis vs neural synthesis

The synthesis section answers a question builders actually face: how do you make more environments without writing each one by hand? The survey splits automated synthesis into two paradigms.

Symbolic synthesis builds environments from explicit structure. Task-driven synthesis wraps existing static tasks behind a standardized interface; real-world-driven synthesis maps real platforms (web, GUI, databases, games) into interactive shells; de novo synthesis composes new tasks from logic, tools, and code. The payoff is verifiability. You can check correctness deterministically, which is what reinforcement-learning training needs.

Neural synthesis uses LLMs or world models to generate scenarios, transitions, and even the reward signal. It scales further and covers messier, open-ended situations, but it inherits the model’s hallucinations and is hard to verify. The survey frames the open problem plainly: symbolic systems are reliable but bounded, neural systems are scalable but unfaithful, and nobody has cleanly combined them. It also lays out four evaluation criteria for any synthesized environment: correctness, diversity, complexity, and fidelity.

Agent-environment co-evolution

The application section is where the survey is most opinionated. It argues environments and agents improve together, and it splits agent evolution into four pathways:

  • Memory-centric: the agent accumulates experience and reuses it (experience evolution).
  • Orchestration-centric: workflows and tool-use patterns improve over runs.
  • Trajectory-centric: offline learning from collected interaction logs.
  • Exploration-centric: online, exploration-driven improvement inside a live environment.

On the other side, environments themselves evolve along three paradigms: neural-driven (the world model generates harder or richer states), difficulty-driven (curriculum that raises task difficulty as the agent improves), and scaling-driven (scenario-level and environment-level scaling to add breadth). The forward-looking part flags three directions: Environment-as-a-Service, multi-agent environments, and neural-symbolic environments that fuse symbolic verifiability with neural generativity.

Key results

This is a survey, so the results are the taxonomy and the field-level claims, not metrics:

  • Scope: four lifecycle stages (modeling, synthesis, evaluation, application) unified under one framework; eight attribute axes plus eight domains.
  • Catalog size: roughly 145 environment entries are tabulated with size, modality, observability, multi-agent support, continuity, and online status, each linked to its GitHub or Hugging Face repo.
  • Synthesis split: two paradigms (symbolic, neural) with three symbolic sub-types (task-driven, real-world-driven, de novo) and four evaluation criteria (correctness, diversity, complexity, fidelity).
  • Co-evolution structure: four agent-evolution pathways plus three environment-evolution paradigms.
  • Sharpest claim: current environments are inadequate for multi-agent settings, and the symbolic-reliability vs neural-scalability tradeoff is unresolved.

Limits and open questions

The survey is a snapshot, and it shows. Most cataloged environments are single-agent, deterministic, and discrete, because those are the ones easy to build and verify, so the taxonomy can overstate how broad the field is. The same selection effect hits the multi-agent column: it is thin not because the survey missed work, but because the field has produced little. That makes the headline “inadequate for multi-agent” partly a finding and partly an artifact of what exists.

A second gap is empirical. The survey asserts that symbolic environments are reliable and neural ones scalable, but it does not measure either side. There is no shared benchmark comparing a symbolic and a neural version of the same task on correctness or transfer, so the central tradeoff stays a qualitative argument. The fidelity criterion in particular is named but not operationalized: how do you score whether a neural-synthesized environment is faithful to the world it imitates?

Finally, a taxonomy is only useful if it predicts something. The strongest test the field could run, and the survey cannot, is whether agents trained in environments high on a given attribute axis transfer to real deployments better than agents trained on the cheap, verifiable end of each axis. Until that exists, the eight axes are a clean filing system rather than a validated theory of what makes an environment train good agents.

FAQ

What does the Agentic Environment Engineering survey actually cover?

It organizes agentic environments for LLM agents into one engineering lifecycle: modeling (eight attribute axes, eight domains), automated synthesis (symbolic and neural), evaluation, and application (agent-environment co-evolution). It catalogs about 145 named environments such as WebArena, SWE-bench, ALFWorld, and Voyager with their attributes and repo links.

What are the eight attributes used to classify agentic environments?

The eight paired axes are symbolic vs neural, open-loop vs closed-loop, online vs offline, MDP vs POMDP, deterministic vs nondeterministic, discrete vs continuous, unimodal vs multimodal, and single-agent vs multi-agent. Each axis predicts what an environment can train; a deterministic single-agent text task is cheap and verifiable but cannot teach partial-observability or negotiation.

How does symbolic synthesis differ from neural synthesis in this survey?

Symbolic synthesis builds environments from explicit structure (wrapping static tasks, mapping real platforms, or composing tasks from logic and code), which makes correctness deterministically checkable. Neural synthesis uses LLMs or world models to generate scenarios and rewards, scaling to open-ended situations but inheriting hallucination and verification problems. The survey calls combining the two an open problem.

Why does the survey say agentic environments are inadequate for multi-agent settings?

Most cataloged environments are single-agent because they are easier to build and verify deterministically. The multi-agent column is thin in the survey’s tables, so it identifies multi-agent environments and Environment-as-a-Service as priority future directions. Read this partly as a real gap and partly as an artifact of what the field has shipped so far.

One line: this CASIA survey is the cleanest current map of agentic environments for LLM agents, strongest as a taxonomy and a gap-finder (multi-agent, neural-symbolic fusion), weakest in that it measures none of the tradeoffs it names. Read the original paper on arXiv.