What Is Grokking in AI? When Models Suddenly Learn to Generalize (2026)
Grokking is when neural networks suddenly transition from memorizing data to truly understanding patterns. Discovered by accident at OpenAI, this phenomenon reveals how AI models learn trigonometric identities to solve math — and what it means for the future of AI. Updated March 2026.
On this page (20)
In 2021, a researcher at OpenAI was training a small neural network on a simple math problem — modular arithmetic. The model memorized the training examples quickly, and the results looked unremarkable. So the researcher went on vacation and left the experiment running.
When they came back, something had changed. The model, which had shown zero improvement on unseen data for thousands of training steps, had suddenly achieved perfect generalization. Not gradual improvement. Not a slow climb. A near-instantaneous leap from rote memorization to genuine understanding.
The team called this phenomenon grokking — after the Martian word from Robert Heinlein's 1961 novel Stranger in a Strange Land, meaning to understand something so deeply that you merge with it. And the name stuck, because what this small model did was genuinely alien.
This is one of the most surprising discoveries in modern AI research. It challenges everything we thought we knew about how neural networks learn, when they learn, and what they're really doing beneath the surface. 🧪
TL;DR: Grokking is when a neural network suddenly transitions from memorizing training data to truly understanding the underlying pattern — often thousands of training steps after memorization is complete. Discovered by accident at OpenAI in 2021, grokking reveals that models can build hidden trigonometric solutions while appearing stagnant. Build with AI agents that learn from your data →

🧠 What Is Grokking?
Grokking is a sudden phase transition in neural network training where a model shifts from memorizing its training data to genuinely generalizing — understanding the underlying pattern well enough to solve examples it has never seen. The transition happens after an extended period of apparent stagnation, during which standard training metrics show no improvement whatsoever.
The term was introduced in the January 2022 paper "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets" by researchers at OpenAI. They borrowed it from Heinlein's science fiction novel, where the Martian word means to understand something so profoundly that you become one with it. It is a fitting name — the model does not merely learn a shortcut or heuristic. It discovers the actual mathematical structure of the problem.
Andrej Karpathy, former head of AI at Tesla, has described the strangeness of neural network behavior this way: "Training LLMs is less like building animal intelligence and more like summoning ghosts." Grokking is perhaps the clearest example of why. A model sits there, seemingly stuck, and then — without any change to the training process — it spontaneously reorganizes its internal representations and solves the problem perfectly.
| Aspect | Memorization | Standard Learning | Grokking |
|---|---|---|---|
| Training performance | Perfect | Improves gradually | Perfect early |
| Test performance | Poor | Improves with training | Flat, then sudden jump |
| Internal structure | Lookup table | Gradual feature extraction | Hidden structure building |
| When it happens | Immediately | During training | Long after memorization |
| What the model learns | Input-output pairs | Approximate patterns | Exact mathematical structure |
Understanding grokking matters because it reveals that what a model appears to know and what it actually knows can be completely different things. This has profound implications for AI safety, model evaluation, and our understanding of how large language models work.
🔬 The Accidental Discovery
The story of grokking begins with a simple experiment at OpenAI in 2021. Researchers were training small transformer models on algorithmic tasks — the kind of clean mathematical problems where you can verify whether a model truly understands the pattern or is just memorizing.
The task was modular arithmetic: given two numbers X and Y, compute (X + Y) mod P, where P is a prime number. Think of it as clock math — on a clock with P hours, if you start at hour X and move forward Y hours, where do you land?
For a small prime like P = 5, the complete dataset is a 5 by 5 table:
| + (mod 5) | 0 | 1 | 2 | 3 | 4 |
|---|---|---|---|---|---|
| 0 | 0 | 1 | 2 | 3 | 4 |
| 1 | 1 | 2 | 3 | 4 | 0 |
| 2 | 2 | 3 | 4 | 0 | 1 |
| 3 | 3 | 4 | 0 | 1 | 2 |
| 4 | 4 | 0 | 1 | 2 | 3 |
The researchers split this table into training and test sets — say, 70% of the cells for training and 30% held out. They trained a small transformer on the training portion and watched what happened.
The initial results were unsurprising. The model memorized the training data within a few hundred steps. Training accuracy hit 100%. But test accuracy — performance on the held-out examples — stayed near random chance. The model had memorized which input pairs mapped to which outputs without learning any underlying rule.
At this point, most researchers would stop the experiment. The model had converged. The loss was flat. Standard practice says: your model has overfit, move on.
But the researcher left the training running — inadvertently, while on vacation. And when they returned days later, they checked the logs and found something nobody expected.
Somewhere around step 7,000, the test accuracy had jumped from near-zero to 100%. Not gradually. The curve looked like a step function — flat, flat, flat, then perfect. The model had gone from a pure memorizer to a perfect generalizer, all while the researcher was not even watching.
The OpenAI team published these findings in January 2022, and the paper sent shockwaves through the machine learning community. It raised an uncomfortable question: how many models have we stopped training just before they were about to grok?
📐 The Modular Arithmetic Problem
To understand why grokking is remarkable, you need to understand what the model is actually seeing — because it is not seeing "numbers."
Modular arithmetic is clock math. On a 12-hour clock, 10 + 5 = 3, because you wrap around past 12. The same idea works with any modulus P. When P = 113 (the prime number used in the key grokking experiments), you have a clock with 113 hours.
But here is the critical detail: the neural network does not receive the numbers 0 through 112 as numeric values. Instead, each number is represented as a one-hot encoded vector — a list of 113 zeros with a single 1 in the position corresponding to that number.
Input: 47 + 81 = ? (mod 113)
Token "47": [0,0,...,0,1,0,...,0] ← 113-dim vector, 1 at position 47
↑
Token "81": [0,0,...,0,1,0,...,0] ← 113-dim vector, 1 at position 81
↑
Token "=": [0,0,...,0,0,0,...,1] ← special token
↑
Total input: 114 × 3 matrix (113 digits + equals sign, 3 tokens)
The model receives three tokens — the first number, the second number, and an equals sign — each represented as a 114-dimensional one-hot vector (113 possible digits plus the equals token). That is a 114 by 3 input matrix.
From the model's perspective, there is no inherent relationship between "47" and "48." They are just two completely different patterns of zeros and ones. The model has no concept of "numbers" or "addition." It must discover the mathematical structure entirely from the patterns in the data.
The architecture is a small transformer: an embedding matrix (114 to 128 dimensions), an attention block, an MLP (multi-layer perceptron), and an unembedding layer that maps back to 113 possible outputs.
One-Hot Input (114×3)
│
▼
┌─────────────────┐
│ Embedding │ 114 → 128 dimensions
│ Matrix │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Attention │ Token interactions
│ Block │
└────────┬────────┘
│
▼
┌─────────────────┐
│ MLP │ Where the magic happens
│ (2 layers) │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Unembedding │ 128 → 113 outputs
│ Matrix │
└─────────────────┘
│
▼
Answer: 15
The question is: what does this model learn internally? A lookup table of memorized answers? Or something far more elegant?
🌊 The Three Phases of Grokking
Careful analysis of grokking reveals three distinct phases, each with radically different internal dynamics. What makes grokking so striking is that standard training metrics cannot distinguish Phase 2 from convergence — the model appears stuck while secretly building a solution.
Phase 1: Memorization (~0-200 steps)
The model rapidly memorizes the training examples. Within about 200 training steps, training accuracy reaches 100%. The model has essentially built an internal lookup table — for each training input pair (X, Y), it has stored the correct answer.
At this point, the model's internal representations show no discernible mathematical structure. If you visualize the neuron activations in the MLP layer, they look like noise. The model treats each input pair as an independent fact to be stored, with no relationship between (3 + 7) and (4 + 6) even though both equal 10 mod 113.
Test performance is at or near chance level. The model has memorized answers, not learned a rule.
Phase 2: Structure Building (~200-7,000 steps)
This is the phase that makes grokking genuinely mysterious. Both training loss and test loss appear completely flat. Standard metrics suggest the model has converged and nothing is changing.
But something is changing.
In early 2023, Neel Nanda and collaborators published a groundbreaking analysis showing exactly what happens during this "dormant" phase. Using a metric called excluded loss — which strips specific frequency components from the model's output before measuring performance — they proved that the model is steadily building trigonometric representations beneath its memorized solution.
Here is what excluded loss reveals: remove the memorization component from the model's output, and you can see a new signal growing stronger step by step. The model is constructing sine and cosine functions of its inputs in the embedding layer, wiring them together through the MLP, and slowly building the machinery for a fundamentally different solution strategy.
The reason standard metrics miss this is simple: the memorized solution works perfectly on the training data and masks the emerging trigonometric solution. It is like watching someone build a new house inside an old house — from the outside, nothing changes until the old walls come down.
Phase 3: Generalization + Cleanup (~7,000+ steps)
The phase transition. Around step 7,000, the trigonometric solution becomes strong enough to compete with the memorized solution. Test accuracy shoots from near-zero to near-perfect in a span of just a few hundred steps.
But there is a second, equally important process: cleanup. After generalization, the model actively removes its memorized representations. The internal lookup table that served it through Phase 1 is dismantled, and the clean trigonometric solution is all that remains.
Performance
│
100%├─────────────■■■■■■■■■■■■■■■■■■■■■■■■■■■ Training
│ ┌──■■■■■ Test
│ │
50%├────────────────────────────────────┤
│ │
0%├──■─────────────────────────────────┘
└──┬──────────┬──────────┬──────────┬──▶ Steps
0 2000 5000 7000
◄──Phase 1──►◄───Phase 2────►◄─Phase 3─►
Memorize Build Structure Generalize
This three-phase pattern has been replicated across many modular arithmetic tasks and other algorithmic problems. It appears to be a fundamental property of how small neural networks discover mathematical structure — and it raises deep questions about what might be happening inside large language models during their own training.

🎵 The Trigonometric Solution
This is the most extraordinary part of the grokking story. When researchers cracked open the model and examined what it had learned, they did not find a cleverer lookup table or a brute-force approximation. They found that the neural network had independently discovered trigonometry.
What the Embedding Layer Learns
After grokking, the embedding matrix transforms each one-hot input into a 128-dimensional vector. When researchers applied a sparse linear probe — a technique that finds interpretable directions in high-dimensional space — they discovered that the most important components of these embedding vectors were sine and cosine functions of the input values.
For each input number x, the embedding contains strong representations of sin(2πkx/113) and cos(2πkx/113) for specific frequencies k. The model had discovered that circular functions are the natural way to represent numbers on a modular clock.
What the MLP Neurons Compute
The MLP (multi-layer perceptron) is where the computation happens. When researchers plotted the output of individual MLP neurons as a function of the two inputs x and y, they found sweeping sine wave patterns.
Even more revealing: when they plotted pairs of neurons against each other as scatter plots, the data points traced out circles and loops. This is the geometric signature of sine and cosine — the model had organized its neurons into circular representations.
A discrete Fourier transform of the neuron activations confirmed specific dominant frequencies: 8π/113 and 6π/113 appeared as the strongest components. The model had not learned all possible frequencies — it had selected a sparse set of frequencies that were sufficient to solve the problem.
The Sum-of-Angles Identity
Here is where it all comes together. The model needs to compute (x + y) mod 113 — it needs to find the sum of its two inputs. But the MLP's basic operation is multiplication (matrix multiply followed by nonlinearity). How do you convert products into sums?
The answer is a trigonometric identity that every calculus student has seen:
cos(x + y) = cos(x) · cos(y) - sin(x) · sin(y)
This is the sum-of-angles identity. It converts the sum (x + y) — which is what the model needs — into a combination of products of cos(x), cos(y), sin(x), and sin(y) — which are exactly the representations the embedding layer has built.
The Model's Solution (decoded):
Step 1: Embed → sin(kx), cos(kx), sin(ky), cos(ky)
Step 2: Products → cos(kx)·cos(ky), sin(kx)·sin(ky)
Step 3: Identity → cos(kx)·cos(ky) - sin(kx)·sin(ky) = cos(k(x+y))
Step 4: Decode → "For which answer z does cos(k(x+y)) peak?"
The model computes cos(kx) · cos(ky) as the strongest component of certain MLP neurons. It then combines these with the sin products via the sum-of-angles identity to produce cos(k(x + y)) — a function that depends only on the sum of x and y, which is exactly the quantity it needs to compute.
Diagonal Symmetry
The most visually striking evidence comes from plotting neuron activations as a heat map over all possible (x, y) pairs. After grokking, individual neurons show diagonal stripe patterns: a neuron fires strongly for all input pairs where x + y equals the same value (modulo 113).
y
0 1 2 3 4
┌──┬──┬──┬──┬──┐
0 │0 │1 │2 │3 │4 │ ← Diagonal lines =
1 │1 │2 │3 │4 │0 │ same sum (mod 5)
x 2 │2 │3 │4 │0 │1 │
3 │3 │4 │0 │1 │2 │ The model learns to
4 │4 │0 │1 │2 │3 │ fire along these
└──┴──┴──┴──┴──┘ diagonals!
For the actual experiment with mod 113, a neuron might fire for all (x, y) pairs where x + y = 65 (mod 113). That includes (0, 65), (1, 64), (2, 63), and so on — but also (100, 78), because 100 + 78 = 178, and 178 mod 113 = 65. The neuron fires along a diagonal that wraps around the grid, which is exactly what you would expect from a circular (trigonometric) representation.
This is a neural network that was given nothing but random-looking patterns of ones and zeros. It received no hint that the numbers it was working with had any circular structure. And yet it independently discovered:
- That circular functions are the right representation
- That specific frequencies capture the necessary information
- That the sum-of-angles identity converts products into sums
- That diagonal symmetry over the input grid solves the problem
As Nanda's analysis concluded, grokking gives us a transparent box in a world of black boxes — a case where we can fully trace how a neural network solves a problem, from input to output, with no hidden mystery.

🔍 Why Grokking Matters
Grokking is a fundamental phenomenon that reveals hidden dynamics of how neural networks learn, with consequences that stretch from training methodology to AI safety.
Hidden Learning Is Real
The most important lesson of grokking is that models can appear to have stopped learning while secretly developing new capabilities. During Phase 2, every standard metric — training loss, test loss, validation accuracy — shows no improvement. Any reasonable practitioner would conclude the model has converged.
But the model is not converged. It is actively building a fundamentally different solution strategy. The only way to detect this hidden learning is with specialized probes like excluded loss or mechanistic analysis of internal representations.
This has immediate practical implications: we may be stopping training too early on many models. If grokking-like dynamics occur in larger networks (and there is growing evidence they do), we could be leaving significant generalization performance on the table by using standard early stopping criteria.
Implications for AI Safety
For the AI safety community, grokking is a cautionary tale. If a model can develop hidden capabilities that are invisible to standard evaluation during training, then our current safety evaluation methods may be fundamentally insufficient.
Consider the alignment scenario: a model might appear to behave as intended during evaluation — all benchmarks look good, all safety filters pass — while internally developing representations that could lead to unexpected behavior. Grokking proves that this is not a theoretical concern. It happens in practice, in simple models, on simple tasks.
The phenomenon also connects to ongoing work in mechanistic interpretability — the effort to understand what neural networks are computing internally, rather than evaluating them only by their outputs. Grokking models are valuable test cases because we can verify mechanistic explanations against the known trigonometric solution.
Emergent Capabilities in Large Models
Researchers have observed grokking-like phenomena in larger models and more complex tasks. The sudden appearance of new capabilities in large language models — what the field calls "emergent abilities" — may share underlying mechanisms with grokking in small transformers.
When a model suddenly becomes capable of chain-of-thought reasoning, or suddenly learns to follow instructions, or suddenly develops the ability to do arithmetic at scale — these phase transitions echo the sudden generalization in grokking. The connection remains an active area of research, but grokking provides a concrete, interpretable example of how such sudden shifts can occur.
The broader question for AI development is whether we can predict when these transitions will happen, and whether we can design training procedures that encourage them rather than leaving them to chance.
🤖 From Grokking to Workspace Intelligence
Grokking demonstrates that AI systems can develop deep understanding from exposure to data — moving from surface-level pattern matching to discovering the fundamental structure of a problem. This principle resonates with how modern AI agents develop contextual intelligence in practical applications.
Taskade AI agents embody a similar philosophy of progressive understanding. When you train an agent on your workspace data — documents, conversations, project histories, and team workflows — it does not just memorize keywords. Through persistent memory and knowledge training, agents build progressively deeper representations of how your team operates.
Here is the connection:
- Grokking models go from memorizing (X + Y) mod 113 to discovering trigonometric identities
- Taskade agents go from matching keywords to understanding workflow context — knowing that a "sprint review" means different things to your engineering team versus your marketing team
This progression is powered by Workspace DNA: Memory (persistent context from projects and documents), Intelligence (11+ frontier models from OpenAI, Anthropic, and Google), and Execution (100+ integrations and automated workflows that act on insights).
What you can build today:
- Custom AI agents with 22+ built-in tools, persistent memory, and slash commands that understand your domain
- Automated workflows that trigger based on context — not just rules — using Temporal durable execution
- Genesis Apps that turn prompts into live dashboards, portals, and tools your team can use immediately
- Multi-agent teams where specialized agents collaborate on complex tasks, each with their own knowledge base
The same way grokking reveals that small models can discover profound mathematical truths, workspace AI reveals that agents with the right data and architecture can develop genuine operational intelligence. Try building your first AI agent →
🔮 The Bigger Picture
From a word in a 1961 science fiction novel about a Martian who understood things so deeply he could make them disappear, "grokking" has become one of the most important concepts in modern AI research.
The phenomenon reminds us that artificial intelligence is genuinely strange. A neural network with no concept of trigonometry, trained on nothing but patterns of zeros and ones, independently discovers sine waves, Fourier analysis, and the sum-of-angles identity. It finds the same solution that took human mathematicians centuries to develop — and it finds it by accident, while its trainer was on vacation.
Karpathy's observation keeps proving true: "Training LLMs is less like building animal intelligence and more like summoning ghosts." These intelligences are alien. They do not think the way we think. They find solutions we would not consider, using representations we can barely interpret.
But grokking also offers hope. It is one of the few cases in all of AI where we can fully understand what a neural network is doing — a transparent box in a world of black boxes. As the field of mechanistic interpretability grows, grokking provides both inspiration and methodology. If we can understand grokking, maybe we can understand the rest.
The next time you watch a training loss curve flatten out and think "time to stop," remember: the model might be about to grok. 🧪
❓ Frequently Asked Questions
What is grokking in AI?
Grokking is a phenomenon in neural network training where a model suddenly transitions from memorizing its training data to truly generalizing — understanding the underlying mathematical pattern. It typically occurs long after the model has perfectly memorized the training set, during a period when standard training metrics show no improvement. The term was coined by OpenAI researchers in 2022, borrowing from Robert Heinlein's 1961 novel Stranger in a Strange Land.
How is grokking different from normal learning?
In normal machine learning, training and test performance improve together — as the model learns the training data, it simultaneously gets better at unseen examples. In grokking, the model first memorizes the training data (with no test improvement), then appears to stagnate for thousands of steps, and finally achieves sudden perfect generalization. The key difference is the delayed phase transition between memorization and understanding.
Why did the model discover trigonometry?
The model was not taught trigonometry or given any mathematical hints. It discovered trigonometric representations because circular functions are the natural solution to modular arithmetic. Modular arithmetic is inherently circular — numbers "wrap around" after reaching the modulus, just like hours on a clock. Sine and cosine functions are the mathematical tools that describe circular behavior. The model found the most efficient solution through gradient descent, and that solution happened to be trigonometry.
Can grokking happen in large language models?
Evidence suggests that grokking-like dynamics occur in larger models and more complex tasks. The sudden emergence of new capabilities in large language models — such as chain-of-thought reasoning or instruction following — may share underlying mechanisms with grokking. However, the clean three-phase pattern is most clearly demonstrated in small models on algorithmic tasks, where the full internal computation can be analyzed.
What is excluded loss and why does it matter?
Excluded loss is a diagnostic metric that removes specific frequency components from a model's output before measuring performance. It was developed by Neel Nanda and collaborators to reveal hidden learning during Phase 2 of grokking. Standard loss metrics cannot detect the trigonometric solution being built because the memorized solution masks it. Excluded loss strips away the memorization component, revealing steady progress toward generalization even when the model appears stuck.
What does grokking mean for AI training practices?
Grokking suggests that early stopping — halting training when metrics plateau — may cause us to miss significant generalization improvements. It also suggests that weight decay and regularization play important roles in encouraging the transition from memorization to generalization. Stronger regularization tends to speed up grokking, while weaker regularization delays it.
How does grokking connect to mechanistic interpretability?
Grokking is a cornerstone case study in mechanistic interpretability — the field of reverse-engineering neural networks to understand their internal computations. Because the grokked solution (trigonometric identities for modular arithmetic) is mathematically clean and fully understood, researchers can verify their interpretability techniques against a known ground truth. This makes grokking models invaluable testbeds for developing tools that might eventually explain frontier models.
How can I experiment with grokking myself?
You can reproduce grokking with relatively modest compute. Train a small transformer (1-2 layers, 128-dimensional embeddings) on modular addition with a prime modulus (P = 113 is the standard benchmark). Use about 70% of the complete dataset for training, apply weight decay regularization, and train for at least 10,000 steps. Monitor both training and test accuracy — you should see the characteristic flat period followed by sudden generalization.
🚀 Build AI That Understands Your Work
Grokking shows that neural networks can achieve deep understanding — not just memorization. Taskade AI agents bring that same principle to your workspace.
- ✅ Custom AI agents with persistent memory and 22+ built-in tools
- ✅ Knowledge training — agents learn from your docs, projects, and team data
- ✅ Multi-model support — 11+ frontier models from OpenAI, Anthropic, and Google
- ✅ Automated workflows with 100+ integrations
- ✅ Genesis Apps — build live tools from prompts, deploy instantly
👉 Start building with Taskade AI agents →
💡 Before you go... Check out these related articles:
- What Are AI Agents? — How autonomous agents plan, reason, and act
- How Do LLMs Work? — Transformers, training, and inference explained
- What Is Mechanistic Interpretability? — Reverse-engineering neural networks
- What Is Generative AI? — The technology behind modern AI
- What Is OpenAI? — History of ChatGPT and GPT models
- What Is Anthropic? — History of Claude AI
- Agentic Workspaces — AI-powered workspace intelligence
- From Bronx Science to Taskade Genesis — Connecting the dots of AI history
- They Generate Code. We Generate Runtime — The Genesis Manifesto
- The BFF Experiment — From Noise to Life
- What Is Artificial Life? — How intelligence emerges from code
- What Is Intelligence? — From neurons to AI agents
- Explore Taskade Community — Templates, agents, and workflows




