Definition: Chain-of-thought (CoT) is a prompting technique that asks a large language model to produce its reasoning as an explicit sequence of intermediate steps before giving a final answer. First formalized by Jason Wei and collaborators at Google Research in the 2022 paper "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models", CoT became the foundation for every modern reasoning technique โ ReAct, tree-of-thought, self-consistency, reflection, and the explicit "thinking" modes in frontier models like Claude Opus, GPT-o1, and Gemini 3.
The core insight is counterintuitive. A language model is not a calculator. It does not compute in a hidden scratchpad. Its "reasoning" lives in the tokens it generates. If you ask for only the answer, the model has no room to think. If you ask for the steps, the steps become the thinking.
Why Chain-of-Thought Transformed AI
Before CoT, scaling a model meant better recall and smoother text โ not better reasoning. GPT-3 could write a sonnet but failed grade-school arithmetic like 43 ร 67. The Wei et al. paper showed that when the same model was prompted to show its work, accuracy on math word problems jumped from roughly 18 percent to 57 percent. Nothing about the model changed. Only the prompt.
This discovery rewired AI research. It established two principles that dominate 2026:
- Reasoning is compute at inference time. You trade tokens for thought. More tokens spent on intermediate steps means more capable output, up to a point.
- Emergent abilities appear with scale. CoT barely helps small models (below ~60 billion parameters). It helps frontier models dramatically. Reasoning is an emergent capability that shows up only when the substrate is large enough.
Every "thinking" model shipping today โ Claude Opus 4.6, GPT-o1, DeepSeek R1, Gemini 3 Deep Think โ is a direct descendant of chain-of-thought prompting, now baked into the model through reinforcement-learning post-training so the user no longer has to ask for it.
CoT vs Direct Answering
The model has no hidden scratchpad. Its reasoning lives in the tokens it generates. Direct answering gives it one forward pass of compute. CoT gives it as many as it needs.
How Chain-of-Thought Works
The mechanism is simple enough to fit in a sentence: ask the model to reason step by step, and include one or more worked examples in the prompt.
Zero-shot CoT (Kojima et al., 2022) โ Add "Let's think step by step" to the end of the prompt. No examples needed. This is the cheapest form and the one most chat apps default to.
Few-shot CoT (Wei et al., 2022) โ Include one to eight demonstration problems in the prompt, each with a worked step-by-step solution. The model pattern-matches the demonstration style when answering the new question.
Self-consistency (Wang et al., 2022) โ Generate many CoT responses at a higher temperature and take the majority-vote answer. Costs more tokens but boosts accuracy on hard problems.
Automatic CoT (Auto-CoT) โ The model clusters questions, generates exemplars for each cluster, and constructs the CoT prompt automatically. Useful when you do not have hand-written examples.
Tree-of-thought and graph-of-thought โ Instead of one linear chain, the model explores multiple branches, evaluates them, and backtracks. Think chess engine for language. Costs 10 to 100 times more tokens but unlocks tasks single-chain reasoning cannot solve.
The Mechanics: Why Tokens Are Thought
A transformer generates one token at a time, attending to all previous tokens. The model has no memory between calls โ only the prompt and the output so far. That means every intermediate step the model produces becomes part of the context for the next token.
If the question is "If Alice has 3 times as many apples as Bob, and together they have 24, how many does Alice have?" a direct-answer model has one forward pass to produce "18." A CoT model produces: "Let Bob have x. Alice has 3x. Together: x + 3x = 4x = 24. x = 6. Alice has 3 ร 6 = 18." Every intermediate token gives the model additional depth. The math happens in the output stream.
This is why CoT is strictly more powerful than equivalent-parameter direct answering: the model's effective depth for a hard question grows with the number of reasoning tokens it writes.
CoT vs Agentic Reasoning
CoT is a single-turn pattern. The model reasons and answers, all inside one response. Agentic AI extends CoT across many turns, interleaving reasoning with tool use.
| Pattern | Scope | Example |
|---|---|---|
| Chain-of-thought | One turn, no tools | Solve a math problem |
| ReAct | Multi-turn, reasoning + tools | Answer a question by searching the web |
| Tree-of-thought | Branching one-turn | Solve a game state |
| Agentic RAG | Multi-turn retrieval + reasoning | Research report from many sources |
| Multi-agent | Multi-agent, each using CoT | Manager delegates subtasks to specialists |
The line between CoT and agentic reasoning has blurred since 2024. Frontier models now do internal chain-of-thought before every visible response, and agentic systems like Taskade Genesis use CoT at every step of a multi-turn automation.
Chain-of-Thought in Practice
Three patterns appear in almost every production system:
Hidden CoT โ The model thinks step-by-step, but only the final answer is shown to the user. This is how Claude, GPT-o1, and DeepSeek R1 work by default. Reasoning tokens are still billed but not rendered.
Visible CoT โ The reasoning steps are shown in the UI, often in a collapsible "thinking" panel. Useful when users need to trust the answer or catch errors. Taskade EVE's build logs follow this pattern during Genesis app generation.
Structured CoT โ The reasoning is constrained to a specific schema: hypothesis, evidence, counter-evidence, conclusion. Common in enterprise analytics agents where explainability is regulated.
Failure Modes
CoT is not a silver bullet. Known pitfalls:
1. Faithfulness gap. The steps the model shows do not always match the computation it used internally. A model may produce the correct answer despite an incorrect-looking chain, or vice versa. This matters for safety-critical applications.
2. Verbose drift. Without a token budget, CoT can ramble. Set a maximum output length or structure the chain with headers to keep it focused.
3. Confidence inflation. Models that produce long chains sound more certain even when they are wrong. Combine CoT with self-consistency or grounded retrieval to catch this.
4. Emergent-only. CoT does little for small models. If you are running a 7B or 13B local model, expect modest gains. Reserve CoT for frontier-class models.
Chain-of-Thought in Taskade Genesis
Every Taskade AI agent runs on a CoT-native frontier model. You do not have to prompt for reasoning โ the model reasons by default before calling tools, responding to users, or executing automations.
When EVE builds a Genesis app from a prompt, the process is a chain-of-thought at every level:
- Parse the user's intent (CoT reasoning)
- Draft a plan of tasks and components (CoT reasoning)
- Call build tools in sequence, observing each result (CoT + tool use = ReAct)
- Verify the result against the plan (CoT reflection)
- If something is ambiguous, call the Ask Questions tool and wait for the user
The same pattern runs in your automations: the automation triggers, the agent reasons about the incoming payload, the agent calls tools, the agent reasons about the results, the automation completes.
Related Concepts
- ReAct Pattern โ CoT interleaved with tool calls
- Prompt Engineering โ The discipline CoT belongs to
- Prompt Chaining โ Multi-prompt workflows
- Agentic AI โ Why CoT is the foundation
- Tool Use โ What CoT enables in agents
- Large Language Models โ Where CoT works and why
- Emergent Behavior โ Why CoT only appears at scale
Frequently Asked Questions About Chain-of-Thought
What is chain-of-thought prompting?
Chain-of-thought (CoT) prompting asks a large language model to produce its reasoning as an explicit sequence of intermediate steps before giving a final answer. It turns language generation into a visible reasoning process and dramatically improves accuracy on multi-step problems.
Who invented chain-of-thought?
CoT was formalized by Jason Wei and collaborators at Google Research in the 2022 paper "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." A related technique, zero-shot CoT ("Let's think step by step"), was introduced by Takeshi Kojima and collaborators the same year.
Does chain-of-thought work on every model?
No. CoT is an emergent capability that shows up only in sufficiently large models (roughly 60+ billion parameters). Small models produce reasoning-shaped text without the underlying competence. Frontier models from OpenAI, Anthropic, and Google all use CoT effectively; most small open-source models do not.
What is the difference between CoT and ReAct?
Chain-of-thought is a single-turn reasoning pattern โ the model reasons and answers in one response. The ReAct pattern extends CoT across multiple turns by interleaving reasoning with tool calls, so the model can observe the world between steps.
How do Taskade agents use chain-of-thought?
Every Taskade AI agent runs on a CoT-native frontier model from OpenAI, Anthropic, or Google. The model reasons step-by-step before tool calls, during automations, and while building Genesis apps. You do not prompt for CoT; the models do it automatically.
Further Reading
- What Is Agentic AI? โ Why CoT is the base layer
- What Are AI Agents? โ How CoT becomes autonomy
- Multi-Agent Collaboration: Production Lessons
- The Perceptron โ The 67-year arc from weights to reasoning
