LLM Reasoning · Language Models

ReAct: How Interleaving Reasoning and Acting Built the LLM Agent

ReAct interleaves a model's reasoning traces with task actions like search and API calls, cutting chain-of-thought hallucination and beating RL agents on ALFWorld by 34% absolute with one or two examples.

ReAct: How Interleaving Reasoning and Acting Built the LLM Agent

Quick answer

ReAct prompts a language model to alternate between writing a reasoning trace and taking an action — a search query, an API call, a step in an environment — so each thought is grounded by a real observation before the next thought. On the ALFWorld embodied task it beats imitation and reinforcement learning baselines by 34% absolute success rate, and on WebShop by 10% absolute, using only one or two in-context examples. On HotpotQA and Fever it interacts with a simple Wikipedia API to cut the hallucination and error propagation that plague chain-of-thought reasoning on its own.

The problem ReAct attacks

Two LLM abilities had been studied in isolation. Chain-of-thought prompting lets a model reason out loud, but the reasoning is a closed monologue: it never checks anything against the outside world, so a single wrong intermediate fact snowballs into a confident wrong answer. Action-generation methods let a model take steps in an environment, but without explicit reasoning the model loses track of the high-level plan and cannot recover from surprises. The bet behind ReAct is that these two failure modes are each other’s cure.

Interleaving thought and action

The mechanism is almost embarrassingly simple, which is part of why it spread. The model’s output space is augmented with free-form “thoughts” alongside the usual actions. A trajectory looks like Thought → Action → Observation → Thought → Action → …, where the observation is whatever the environment or tool returns. Thoughts do work that actions cannot: they decompose the goal, track what is still unknown, reformulate a failed search, and decide when to stop. Actions do work that thoughts cannot: they pull a real fact from Wikipedia, click a product, or move in a room. Crucially, none of this is trained — ReAct is a prompting strategy demonstrated with one or two hand-written examples, so a single capable base model (PaLM-540B in the paper, later GPT-class models) does it zero-shot in spirit.

The honest read: ReAct is not a new architecture or a new objective. It is a format. Its contribution is showing that the format itself, given a strong enough model, closes a real reliability gap — and that grounding reasoning in observations matters more than making the reasoning longer.

Key results

  • ALFWorld: ReAct beats imitation- and RL-trained agents by 34% absolute success rate, prompted with only one or two examples versus baselines trained on thousands of trajectories.
  • WebShop: 10% absolute success-rate gain over prior imitation and RL methods on this online-shopping benchmark.
  • HotpotQA / Fever: acting alone (a Wikipedia API) and reasoning alone each fall short; interleaving them is what removes the hallucination and error propagation that pure chain-of-thought suffers from.
  • Interpretability: the thought-action-observation trace is human-readable, so you can see exactly why the model reached an answer — and a human can edit a thought mid-trajectory to steer it back on course.

Why it matters now

ReAct is the paper most modern LLM agents quietly descend from. The “reason, call a tool, observe, reason again” loop is the spine of essentially every tool-using agent framework — LangChain’s agents, function-calling loops, and retrieval-augmented agents all reimplement this idea. It reframed the question from “can a model think?” to “can a model think and check?”, and the answer — that grounding beats raw reasoning length — is why tool use and retrieval became the default way to make LLMs reliable rather than just scaling chains of thought. If you build agents today, you are building on ReAct whether the word appears in your code or not.

Limits and open questions

ReAct’s gains ride entirely on the base model’s quality and the tool’s reliability — give it a weak Wikipedia search that returns nothing useful and the reasoning has nothing to ground against, so it stalls or hallucinates anyway. The one-or-two-example framing is elegant but brittle: performance is sensitive to how the exemplars are written, and the paper’s strongest numbers come from a few-shot setup the authors hand-tuned. The trajectories can also loop — a model can keep re-searching without converging — and ReAct itself offers no principled stopping or backtracking guarantee beyond what the prompt encourages. Finally, the benchmarks (ALFWorld, WebShop, HotpotQA, Fever) are narrow, single-tool, mostly-text worlds; whether the same loop holds up across many tools, long horizons, and real production stakes is exactly the problem the agent field has been wrestling with since.

FAQ

What does ReAct actually do differently from chain-of-thought?

Chain-of-thought reasons in a closed loop with no access to the world. ReAct interleaves each reasoning step with an action — a search, an API call, an environment step — so the next thought is conditioned on a real observation. That grounding is what cuts the hallucination and error propagation that pure chain-of-thought suffers from.

How much does ReAct improve over reinforcement learning agents?

On ALFWorld it beats imitation- and RL-trained agents by 34% absolute success rate, and on WebShop by 10% absolute — while using only one or two in-context examples instead of thousands of training trajectories.

Is ReAct a trained model or a prompting method?

It is a prompting method. ReAct adds free-form reasoning “thoughts” to the action space and demonstrates the pattern with one or two examples; no fine-tuning is required, so any sufficiently capable base model can run it.

Why is ReAct considered the foundation of LLM agents?

The reason-act-observe loop ReAct introduced is the structure behind nearly every modern tool-using agent and function-calling framework. It established that grounding reasoning in tool observations, rather than just lengthening the reasoning, is what makes LLMs reliable at multi-step tasks.

One line: let the model think, then check, then think again — grounding reasoning in real observations beats reasoning longer in a vacuum. Read the original paper on arXiv.