LLM Reasoning · Language Models
Chain-of-Thought Prompting: How Showing the Steps Unlocks LLM Reasoning
Showing a few worked examples with intermediate reasoning steps lets big models solve multi-step problems — a 540B model with 8 chain-of-thought exemplars hits 57% on GSM8K, beating fine-tuned GPT-3 with a verifier.
Quick answer
Chain-of-thought prompting is a one-line change to a prompt: instead of showing the model question-and-answer pairs, you show question-then-worked-reasoning-then-answer pairs. With that, a 540B-parameter model given just eight chain-of-thought exemplars reaches 57% on the GSM8K grade-school math benchmark — state of the art at the time, and higher than a fine-tuned GPT-3 paired with a separately trained answer verifier. No gradient updates, no extra training data; the gain comes entirely from how the few-shot examples are written.
What a chain-of-thought prompt looks like
Standard few-shot prompting gives the model exemplars of the form Q: ... A: <final answer>. The model is implicitly asked to jump straight to the answer. Chain-of-thought prompting changes the exemplars to Q: ... A: <a sentence of reasoning, then the final answer> — it spells out the intermediate steps a person would write on scratch paper before committing to a number.
That is the entire method. The authors hand-write roughly eight such exemplars, prepend them to the real question, and let the model imitate the format: it generates its own reasoning trace and only then states the answer. There is no fine-tuning, no architecture change, and no verifier. The reasoning the model produces is not guaranteed to be the “real” computation happening inside it, but writing it out demonstrably changes what the model outputs.
Why it only works at scale
The most important and most under-appreciated finding is that chain-of-thought is an emergent ability, not a universal trick. On models below roughly 100B parameters, adding reasoning steps does not help and sometimes hurts — small models write fluent-looking but logically broken chains and end up worse than if they had just guessed. The benefit appears sharply once models pass the ~100B mark and grows with scale.
This is the part most summaries get wrong. Chain-of-thought is often pitched as “just ask the model to think step by step and it gets smarter.” That framing is misleading: the same prompt that lifts a 540B model leaves a 10B model flat or worse. The capability has to already be latent in a large enough model; the prompt only surfaces it. If your model is small, this paper’s recipe will not save you.
Key results
- GSM8K (math word problems): PaLM 540B with chain-of-thought prompting reaches ~57% solve rate, up from ~18% with standard prompting — and surpasses fine-tuned GPT-3 augmented with a trained verifier, the prior state of the art.
- The gains scale, not shrink: across model sizes the chain-of-thought advantage over standard prompting widens as the model grows, which is the signature of an emergent ability rather than a constant offset.
- Three reasoning types, not just math: the method improves arithmetic, commonsense (e.g. CSQA, StrategyQA, date understanding), and symbolic reasoning (last-letter concatenation, coin flips) — and it generalizes to longer symbolic sequences than appeared in the exemplars.
- Generality across models: tested on three families (GPT-3, LaMDA, PaLM), the pattern holds — larger models benefit, smaller ones do not — so the effect is a property of scale, not of one architecture.
Why it matters now
This is the paper that turned “prompt engineering” from folklore into a measurable capability, and it is the conceptual seed of every reasoning model that followed. The o1 / DeepSeek-R1 line — models trained to produce long internal reasoning before answering — is the natural next step: if writing steps in the prompt helps, train the model to generate those steps itself. Chain-of-thought also made the multi-step trace a first-class object you can inspect, critique, sample many times (self-consistency), and reward in reinforcement learning. Almost every modern reasoning technique assumes the model emits intermediate steps; this paper is where that assumption was first shown to pay off.
Limits and open questions
The honest caveats are sharp. Scale is a hard gate: below ~100B parameters the method gives nothing, so it is useless for small or efficiency-constrained deployments. The reasoning is not faithful: a correct-looking chain can reach the right answer for the wrong reasons, and a fluent chain can confidently justify a wrong one — the written steps are not a reliable window into the model’s actual computation, which matters for any safety or auditing use. It is still few-shot prompting: results depend on hand-written exemplars and their phrasing, and the paper does not claim the model learned a general algorithm versus pattern-matching the exemplar style. And the headline 57% on GSM8K, while state of the art in early 2022, is far from solved — chain-of-thought raised the ceiling without making grade-school math reliable.
FAQ
What is chain-of-thought prompting in one sentence?
Chain-of-thought prompting gives a large language model a few exemplars that show the intermediate reasoning steps before the final answer, prompting the model to generate its own step-by-step trace — which sharply raises accuracy on multi-step reasoning tasks.
Does chain-of-thought prompting work on small models?
No. Chain-of-thought is an emergent ability that appears only around 100B parameters and above; on smaller models it produces fluent but flawed reasoning and often performs worse than standard prompting.
How much did chain-of-thought improve GSM8K?
On GSM8K math word problems, PaLM 540B with eight chain-of-thought exemplars reached about 57%, versus roughly 18% with standard prompting, surpassing the prior best of a fine-tuned GPT-3 plus a verifier — with no additional training.
Is the chain of thought the model’s real reasoning?
Not necessarily. The generated steps reliably change the output and improve accuracy, but they are not a guaranteed faithful record of the model’s internal computation; a chain can reach a correct answer through unsound steps or justify a wrong one.
One line: show a big model the steps, and it learns to walk them — the prompt that became the blueprint for every reasoning model since. Read the original paper on arXiv.