Toolformer: How a Language Model Teaches Itself to Use Tools

Quick answer

Toolformer is a language model that decides on its own which external API to call, when, and with what arguments — choosing from a calculator, a question-answering system, two search engines, a translation system, and a calendar. The training is self-supervised: it needs nothing more than a handful of demonstrations per API. The trick is a filter — keep a candidate call only if inserting its result lowers the loss on the tokens that follow. The result substantially improves zero-shot performance on downstream tasks, often matching much larger models, while keeping the base model’s core language-modeling ability intact.

The problem: big models still can’t add

Large language models are remarkable at few-shot tasks, yet they fail at things a pocket calculator nails — arithmetic, looking up a current fact, or precise date math. Scaling the model does not reliably fix this; the knowledge is frozen at training time and arithmetic is learned statistically rather than computed. Toolformer’s premise is that the model should not try to be a calculator — it should learn to call one, then read the answer back into its own text. The hard part is doing this without a large hand-labeled dataset of when and how to call each tool.

Self-supervised API calls

The data is generated by the model itself, in three steps. First, sampling: with a few in-context examples, the model proposes candidate API calls at many positions in plain text from its pretraining corpus — for example, inserting a call like [Calculator(400 / 1400)] where a percentage is being computed. Second, executing those calls to get each result. Third, filtering — the step that makes the whole thing work. Each candidate is scored by whether the API’s returned result, placed before the following text, reduces the model’s loss on those next tokens versus not calling at all (or calling and ignoring the result). Calls that clear a loss-reduction threshold are kept; the rest are thrown away.

Keeping calls that lower loss

That filter is the entire supervision signal, and it is worth dwelling on why it is clever. There is no human telling the model “this sentence needs a calculator.” Instead, usefulness is defined operationally: a tool call is good exactly when its output makes the continuation easier to predict. A calculator call is kept where it actually resolves an arithmetic span; a search call survives where the retrieved snippet genuinely helps the next words. The filtered, in-line annotations are then merged back into the corpus, and the model is fine-tuned on this augmented text with a plain language-modeling objective. Because the API calls are interleaved into ordinary text, the model learns to emit them as a natural part of generation, then incorporate the returned result before continuing.

Key results

Zero-shot gains across tasks. Toolformer (built on a 6.7B-parameter GPT-J) substantially outperforms the same base model on a range of downstream benchmarks, and is often competitive with the much larger GPT-3 (175B) despite being roughly 25x smaller.
Math and factual QA improve most, the two areas where vanilla LMs are weakest — the calculator and QA/search tools carry the gains, which is the expected pattern if the filter is doing real work.
Core LM ability is preserved. Adding tool use does not degrade the model’s underlying language-modeling perplexity — a real concern for any method that fine-tunes on augmented data.
Tiny supervision footprint. Each API needs only a handful of human-written demonstrations to bootstrap the sampling step; everything after that is self-generated.

Why it matters now

Toolformer (Feb 2023) is the cleanest early statement of an idea the whole field then ran with: a model that calls tools beats a model that memorizes. It predates and conceptually underpins today’s function-calling and agent APIs — the difference being that Toolformer learns when to call from data, rather than relying on a prompt or a fixed schema at inference. If you want to understand why “tool use” became a first-class capability rather than a prompting hack, this is the paper that framed it as something a model can teach itself.

Limits and open questions

The method’s elegance is also its ceiling. The loss-reduction filter assumes a tool’s value shows up locally in next-token prediction; tools whose payoff is diffuse or multi-step won’t survive it cleanly. Toolformer cannot use tools in a chain — it does not call one API on the output of another, so genuinely compositional reasoning is out of scope. It is also sensitive to the exact wording of an API call, and each tool is treated independently with no shared budget or cost-awareness. Sampling candidate calls across a large corpus is computationally heavy, and the approach inherits whatever the underlying APIs return — a wrong search result is faithfully trusted. None of this undercuts the core demonstration, but it explains why later agent systems layered planning and multi-tool orchestration on top rather than using Toolformer as-is.

FAQ

How does Toolformer decide when to call a tool?

Toolformer samples candidate API calls at many text positions, executes them, and keeps a call only if inserting its result lowers the model’s loss on the following tokens. That loss-reduction test is the only supervision — no human labels which sentences need a tool.

What tools can Toolformer use?

Five: a calculator, a question-answering system, two search engines, a machine-translation system, and a calendar. Each is bootstrapped from just a handful of demonstrations.

Is Toolformer the same as OpenAI function calling?

No. Both let a model invoke external functions, but Toolformer learns from data when to call, then fine-tunes on the result, whereas function-calling APIs expose a fixed schema the model is prompted to use at inference time. Toolformer is the research precursor to that capability.

Can Toolformer chain tools together?

No. Toolformer calls each API independently and cannot feed one tool’s output into another, so multi-step compositional tool use is a stated limitation.

One line: define a tool call as useful when it lowers next-token loss, and a language model can teach itself when to reach for a calculator. Read the original paper on arXiv.