What is grokking in AI and machine learning?

Grokking is a phenomenon where a neural network suddenly transitions from memorizing training data to truly understanding the underlying pattern, often after an extended period of apparent stagnation. Discovered by OpenAI researchers in 2022, the term comes from Robert Heinlein's 1961 novel Stranger in a Strange Land, where it means to understand something so profoundly that you merge with it.

How was grokking discovered?

Grokking was discovered by accident at OpenAI in 2021. A researcher training small models on modular arithmetic went on vacation and left a model running. Upon returning, they found that after thousands of additional training steps — long after the model had perfectly memorized the training data — it had suddenly achieved perfect generalization on the test set. The team published their findings in January 2022.

How does a neural network solve modular arithmetic through grokking?

When a transformer grokks modular addition, it independently discovers trigonometric identities. The model learns to compute sine and cosine of its inputs in the embedding layer, then uses the sum-of-angles identity cos(x+y) = cos(x)cos(y) - sin(x)sin(y) to convert products into sums. This creates diagonal symmetry patterns where neurons fire for all input pairs that add to the same value.

What are the three phases of grokking?

Grokking occurs in three phases: memorization (the model quickly fits training data in about 200 steps), structure building (training and test loss appear flat for thousands of steps while the model secretly constructs trigonometric representations), and generalization plus cleanup (the model suddenly achieves perfect test performance and removes its memorized solutions).

Why does grokking matter for AI safety and alignment?

Grokking demonstrates that neural networks can appear to have stopped learning while secretly developing new capabilities. This has implications for AI safety because it means model evaluations during training may not reveal what a model is actually learning internally. A model might appear aligned during evaluation but be developing hidden capabilities or behaviors that emerge later.

What is the excluded loss metric?

Excluded loss is a metric developed by Neel Nanda and collaborators that removes specific frequency components from a model's output before measuring performance. When applied to grokking models, excluded loss reveals that the model is steadily building trigonometric representations during the apparently dormant phase — proving that learning continues even when standard metrics show no improvement.

Does grokking happen in large language models?

Researchers have found evidence of grokking-like phenomena in larger models and more complex tasks. While the clean three-phase pattern is most clearly demonstrated in small models on modular arithmetic, the principle that models can develop hidden capabilities during apparent training plateaus has broad implications for understanding how frontier models develop emergent abilities.

What is the connection between grokking and physics?

Grokking is a phase transition — the same type of sudden shift physicists study in magnets and superconductors. In the Ising model of magnetism, a system of atoms undergoes a sudden transition from disorder to order at a critical temperature. In grokking, a neural network undergoes a sudden transition from memorization to generalization when regularization (weight decay) destabilizes the memorized solution. John Hopfield and Geoffrey Hinton won the 2024 Nobel Prize in Physics for recognizing that neural network dynamics and physical systems are governed by the same energy-minimization mathematics.

What are latent operators and how do they connect to grokking?

Latent operators are internal transformations that neural networks learn to apply in their representation space — such as rotation, scaling, or translation. Instead of memorizing every possible orientation of an object, a model that discovers a latent rotation operator can recognize that object from any angle, including angles never seen during training. This is structurally identical to grokking: the model transitions from memorizing individual examples (brittle) to discovering the underlying symmetry (generalizable). Neuroscientists have found that biological brains use the same strategy via grid cells in the entorhinal cortex, which encode abstract transformations for both spatial navigation and conceptual reasoning.

What are surrogate problems and how do they relate to grokking?

Surrogate problems are relaxed or approximate versions of a target problem that are easier to solve. Research from Sakana AI's Shinka Evolve (2026) showed that solving a surrogate version of circle packing (allowing tiny overlaps) converged faster than the exact formulation. This mirrors grokking's dynamics: during Phase 2, the model builds trigonometric representations — effectively solving a different, more fundamental problem — before those representations suddenly enable the target solution. Kenneth Stanley's open-endedness research formalizes this: the path to a solution often runs through problems you did not set out to solve.

How does grokking relate to how Taskade AI agents learn?

While grokking describes a training-time phenomenon in neural networks, the concept of deep understanding resonates with how Taskade AI agents develop contextual knowledge. Agents trained on workspace data build persistent memory and progressively deeper understanding of team workflows — moving from surface-level pattern matching to nuanced task execution through Workspace DNA (Memory + Intelligence + Execution).

BlogAIWhat Is Grokking in AI? When…

What Is Grokking in AI? When Models Suddenly Learn to Generalize (2026)

March 16, 202630 min readDawid BednarskiAI·#grokking #machine-learning #neural-networks

On this page (23)

In 2021, a researcher at OpenAI was training a small neural network on a simple math problem — modular arithmetic. The model memorized the training examples quickly, and the results looked unremarkable. So the researcher went on vacation and left the experiment running.

When they came back, something had changed. The model, which had shown zero improvement on unseen data for thousands of training steps, had suddenly achieved perfect generalization. Not gradual improvement. Not a slow climb. A near-instantaneous leap from rote memorization to genuine understanding.

The team called this phenomenon grokking — after the Martian word from Robert Heinlein's 1961 novel Stranger in a Strange Land, meaning to understand something so deeply that you merge with it. And the name stuck, because what this small model did was genuinely alien.

This is one of the most surprising discoveries in modern AI research. It challenges everything we thought we knew about how neural networks learn, when they learn, and what they're really doing beneath the surface. 🧪

TL;DR: Grokking is when a neural network suddenly transitions from memorizing training data to truly understanding the underlying pattern — often thousands of training steps after memorization is complete. Discovered by accident at OpenAI in 2021, grokking reveals that models can build hidden trigonometric solutions while appearing stagnant. Build with AI agents that learn from your data →

Neural network digital art visualization

🧠 What Is Grokking?

Grokking is a sudden phase transition in neural network training where a model shifts from memorizing its training data to genuinely generalizing — understanding the underlying pattern well enough to solve examples it has never seen. The transition happens after an extended period of apparent stagnation, during which standard training metrics show no improvement whatsoever.

The term was introduced in the January 2022 paper "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets" by researchers at OpenAI. They borrowed it from Heinlein's science fiction novel, where the Martian word means to understand something so profoundly that you become one with it. It is a fitting name — the model does not merely learn a shortcut or heuristic. It discovers the actual mathematical structure of the problem.

Andrej Karpathy, former head of AI at Tesla, has described the strangeness of neural network behavior this way: "Training LLMs is less like building animal intelligence and more like summoning ghosts." Grokking is perhaps the clearest example of why. A model sits there, seemingly stuck, and then — without any change to the training process — it spontaneously reorganizes its internal representations and solves the problem perfectly.

Aspect	Memorization	Standard Learning	Grokking
Training performance	Perfect	Improves gradually	Perfect early
Test performance	Poor	Improves with training	Flat, then sudden jump
Internal structure	Lookup table	Gradual feature extraction	Hidden structure building
When it happens	Immediately	During training	Long after memorization
What the model learns	Input-output pairs	Approximate patterns	Exact mathematical structure

Understanding grokking matters because it reveals that what a model appears to know and what it actually knows can be completely different things. This has profound implications for AI safety, model evaluation, and our understanding of how large language models work.

🔬 The Accidental Discovery

The story of grokking begins with a simple experiment at OpenAI in 2021. Researchers were training small transformer models on algorithmic tasks — the kind of clean mathematical problems where you can verify whether a model truly understands the pattern or is just memorizing.

The task was modular arithmetic: given two numbers X and Y, compute (X + Y) mod P, where P is a prime number. Think of it as clock math — on a clock with P hours, if you start at hour X and move forward Y hours, where do you land?

For a small prime like P = 5, the complete dataset is a 5 by 5 table:

+ (mod 5)	0	1	2	3	4
0	0	1	2	3	4
1	1	2	3	4	0
2	2	3	4	0	1
3	3	4	0	1	2
4	4	0	1	2	3

The researchers split this table into training and test sets — say, 70% of the cells for training and 30% held out. They trained a small transformer on the training portion and watched what happened.

The initial results were unsurprising. The model memorized the training data within a few hundred steps. Training accuracy hit 100%. But test accuracy — performance on the held-out examples — stayed near random chance. The model had memorized which input pairs mapped to which outputs without learning any underlying rule.

At this point, most researchers would stop the experiment. The model had converged. The loss was flat. Standard practice says: your model has overfit, move on.

But the researcher left the training running — inadvertently, while on vacation. And when they returned days later, they checked the logs and found something nobody expected.

Somewhere around step 7,000, the test accuracy had jumped from near-zero to 100%. Not gradually. The curve looked like a step function — flat, flat, flat, then perfect. The model had gone from a pure memorizer to a perfect generalizer, all while the researcher was not even watching.

The OpenAI team published these findings in January 2022, and the paper sent shockwaves through the machine learning community. It raised an uncomfortable question: how many models have we stopped training just before they were about to grok?

📐 The Modular Arithmetic Problem

To understand why grokking is remarkable, you need to understand what the model is actually seeing — because it is not seeing "numbers."

Modular arithmetic is clock math. On a 12-hour clock, 10 + 5 = 3, because you wrap around past 12. The same idea works with any modulus P. When P = 113 (the prime number used in the key grokking experiments), you have a clock with 113 hours.

But here is the critical detail: the neural network does not receive the numbers 0 through 112 as numeric values. Instead, each number is represented as a one-hot encoded vector — a list of 113 zeros with a single 1 in the position corresponding to that number.

Input: 47 + 81 = ?  (mod 113)Token "47":  [0,0,...,0,1,0,...,0]  ← 113-dim vector, 1 at position 47
                    ↑
Token "81":  [0,0,...,0,1,0,...,0]  ← 113-dim vector, 1 at position 81
                          ↑
Token "=":   [0,0,...,0,0,0,...,1]  ← special token
                              ↑
Total input: 114 × 3 matrix (113 digits + equals sign, 3 tokens)

The model receives three tokens — the first number, the second number, and an equals sign — each represented as a 114-dimensional one-hot vector (113 possible digits plus the equals token). That is a 114 by 3 input matrix.

From the model's perspective, there is no inherent relationship between "47" and "48." They are just two completely different patterns of zeros and ones. The model has no concept of "numbers" or "addition." It must discover the mathematical structure entirely from the patterns in the data.

The architecture is a small transformer: an embedding matrix (114 to 128 dimensions), an attention block, an MLP (multi-layer perceptron), and an unembedding layer that maps back to 113 possible outputs.

One-Hot Input (114×3)
        │
        ▼
┌─────────────────┐
│  Embedding      │  114 → 128 dimensions
│  Matrix         │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Attention      │  Token interactions
│  Block          │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  MLP            │  Where the magic happens
│  (2 layers)     │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Unembedding    │  128 → 113 outputs
│  Matrix         │
└─────────────────┘
         │
         ▼
    Answer: 15

The question is: what does this model learn internally? A lookup table of memorized answers? Or something far more elegant?

🌊 The Three Phases of Grokking

Careful analysis of grokking reveals three distinct phases, each with radically different internal dynamics. What makes grokking so striking is that standard training metrics cannot distinguish Phase 2 from convergence — the model appears stuck while secretly building a solution.

Phase 1: Memorization (~0-200 steps)

The model rapidly memorizes the training examples. Within about 200 training steps, training accuracy reaches 100%. The model has essentially built an internal lookup table — for each training input pair (X, Y), it has stored the correct answer.

At this point, the model's internal representations show no discernible mathematical structure. If you visualize the neuron activations in the MLP layer, they look like noise. The model treats each input pair as an independent fact to be stored, with no relationship between (3 + 7) and (4 + 6) even though both equal 10 mod 113.

Test performance is at or near chance level. The model has memorized answers, not learned a rule.

Phase 2: Structure Building (~200-7,000 steps)

This is the phase that makes grokking genuinely mysterious. Both training loss and test loss appear completely flat. Standard metrics suggest the model has converged and nothing is changing.

But something is changing.

In early 2023, Neel Nanda and collaborators published a groundbreaking analysis showing exactly what happens during this "dormant" phase. Using a metric called excluded loss — which strips specific frequency components from the model's output before measuring performance — they proved that the model is steadily building trigonometric representations beneath its memorized solution.

Here is what excluded loss reveals: remove the memorization component from the model's output, and you can see a new signal growing stronger step by step. The model is constructing sine and cosine functions of its inputs in the embedding layer, wiring them together through the MLP, and slowly building the machinery for a fundamentally different solution strategy.

The reason standard metrics miss this is simple: the memorized solution works perfectly on the training data and masks the emerging trigonometric solution. It is like watching someone build a new house inside an old house — from the outside, nothing changes until the old walls come down.

Phase 3: Generalization + Cleanup (~7,000+ steps)

The phase transition. Around step 7,000, the trigonometric solution becomes strong enough to compete with the memorized solution. Test accuracy shoots from near-zero to near-perfect in a span of just a few hundred steps.

But there is a second, equally important process: cleanup. After generalization, the model actively removes its memorized representations. The internal lookup table that served it through Phase 1 is dismantled, and the clean trigonometric solution is all that remains.

Performance
    │
100%├─────────────■■■■■■■■■■■■■■■■■■■■■■■■■■■  Training
    │                                    ┌──■■■■■  Test
    │                                    │
 50%├────────────────────────────────────┤
    │                                    │
  0%├──■─────────────────────────────────┘
    └──┬──────────┬──────────┬──────────┬──▶ Steps
       0        2000       5000       7000   ◄──Phase 1──►◄───Phase 2────►◄─Phase 3─►
   Memorize      Build Structure  Generalize

This three-phase pattern has been replicated across many modular arithmetic tasks and other algorithmic problems. It appears to be a fundamental property of how small neural networks discover mathematical structure — and it raises deep questions about what might be happening inside large language models during their own training.

Training curves and metrics analysis

🎵 The Trigonometric Solution

This is the most extraordinary part of the grokking story. When researchers cracked open the model and examined what it had learned, they did not find a cleverer lookup table or a brute-force approximation. They found that the neural network had independently discovered trigonometry.

What the Embedding Layer Learns

After grokking, the embedding matrix transforms each one-hot input into a 128-dimensional vector. When researchers applied a sparse linear probe — a technique that finds interpretable directions in high-dimensional space — they discovered that the most important components of these embedding vectors were sine and cosine functions of the input values.

For each input number x, the embedding contains strong representations of sin(2πkx/113) and cos(2πkx/113) for specific frequencies k. The model had discovered that circular functions are the natural way to represent numbers on a modular clock.

What the MLP Neurons Compute

The MLP (multi-layer perceptron) is where the computation happens. When researchers plotted the output of individual MLP neurons as a function of the two inputs x and y, they found sweeping sine wave patterns.

Even more revealing: when they plotted pairs of neurons against each other as scatter plots, the data points traced out circles and loops. This is the geometric signature of sine and cosine — the model had organized its neurons into circular representations.

A discrete Fourier transform of the neuron activations confirmed specific dominant frequencies: 8π/113 and 6π/113 appeared as the strongest components. The model had not learned all possible frequencies — it had selected a sparse set of frequencies that were sufficient to solve the problem.

The Sum-of-Angles Identity

Here is where it all comes together. The model needs to compute (x + y) mod 113 — it needs to find the sum of its two inputs. But the MLP's basic operation is multiplication (matrix multiply followed by nonlinearity). How do you convert products into sums?

The answer is a trigonometric identity that every calculus student has seen:

cos(x + y) = cos(x) · cos(y) - sin(x) · sin(y)

This is the sum-of-angles identity. It converts the sum (x + y) — which is what the model needs — into a combination of products of cos(x), cos(y), sin(x), and sin(y) — which are exactly the representations the embedding layer has built.

The Model's Solution (decoded):Step 1: Embed     → sin(kx), cos(kx), sin(ky), cos(ky)
Step 2: Products  → cos(kx)·cos(ky), sin(kx)·sin(ky)
Step 3: Identity  → cos(kx)·cos(ky) - sin(kx)·sin(ky) = cos(k(x+y))
Step 4: Decode    → "For which answer z does cos(k(x+y)) peak?"

The model computes cos(kx) · cos(ky) as the strongest component of certain MLP neurons. It then combines these with the sin products via the sum-of-angles identity to produce cos(k(x + y)) — a function that depends only on the sum of x and y, which is exactly the quantity it needs to compute.

Diagonal Symmetry

The most visually striking evidence comes from plotting neuron activations as a heat map over all possible (x, y) pairs. After grokking, individual neurons show diagonal stripe patterns: a neuron fires strongly for all input pairs where x + y equals the same value (modulo 113).

           y
         0  1  2  3  4
       ┌──┬──┬──┬──┬──┐
    0  │0 │1 │2 │3 │4 │  ← Diagonal lines =
    1  │1 │2 │3 │4 │0 │    same sum (mod 5)
x   2  │2 │3 │4 │0 │1 │
    3  │3 │4 │0 │1 │2 │  The model learns to
    4  │4 │0 │1 │2 │3 │  fire along these
       └──┴──┴──┴──┴──┘  diagonals!

For the actual experiment with mod 113, a neuron might fire for all (x, y) pairs where x + y = 65 (mod 113). That includes (0, 65), (1, 64), (2, 63), and so on — but also (100, 78), because 100 + 78 = 178, and 178 mod 113 = 65. The neuron fires along a diagonal that wraps around the grid, which is exactly what you would expect from a circular (trigonometric) representation.

This is a neural network that was given nothing but random-looking patterns of ones and zeros. It received no hint that the numbers it was working with had any circular structure. And yet it independently discovered:

That circular functions are the right representation
That specific frequencies capture the necessary information
That the sum-of-angles identity converts products into sums
That diagonal symmetry over the input grid solves the problem

As Nanda's analysis concluded, grokking gives us a transparent box in a world of black boxes — a case where we can fully trace how a neural network solves a problem, from input to output, with no hidden mystery.

Abstract futuristic particles representing phase transitions in AI

🔍 Why Grokking Matters

Grokking is a fundamental phenomenon that reveals hidden dynamics of how neural networks learn, with consequences that stretch from training methodology to AI safety.

Hidden Learning Is Real

The most important lesson of grokking is that models can appear to have stopped learning while secretly developing new capabilities. During Phase 2, every standard metric — training loss, test loss, validation accuracy — shows no improvement. Any reasonable practitioner would conclude the model has converged.

But the model is not converged. It is actively building a fundamentally different solution strategy. The only way to detect this hidden learning is with specialized probes like excluded loss or mechanistic analysis of internal representations.

This has immediate practical implications: we may be stopping training too early on many models. If grokking-like dynamics occur in larger networks (and there is growing evidence they do), we could be leaving significant generalization performance on the table by using standard early stopping criteria.

Implications for AI Safety

For the AI safety community, grokking is a cautionary tale. If a model can develop hidden capabilities that are invisible to standard evaluation during training, then our current safety evaluation methods may be fundamentally insufficient.

Consider the alignment scenario: a model might appear to behave as intended during evaluation — all benchmarks look good, all safety filters pass — while internally developing representations that could lead to unexpected behavior. Grokking proves that this is not a theoretical concern. It happens in practice, in simple models, on simple tasks.

The phenomenon also connects to ongoing work in mechanistic interpretability — the effort to understand what neural networks are computing internally, rather than evaluating them only by their outputs. Grokking models are valuable test cases because we can verify mechanistic explanations against the known trigonometric solution.

The Physics of Phase Transitions

Grokking is not just a machine learning curiosity — it is a phase transition, the same kind of sudden shift that physicists study in magnets, superconductors, and boiling water.

In 1920, German physicist Wilhelm Lenz and his student Ernst Ising studied how blocks of iron become magnets. In the Ising model, atoms carry tiny magnetic spins (up or down) that interact with their neighbors. At high temperatures, spins flip randomly — disorder. As the system cools, adjacent spins begin aligning, lowering the system's energy. At a critical temperature, the system undergoes a phase transition: disorder snaps into order, and the block becomes a permanent magnet.

The mathematics of grokking maps directly onto this framework. Imagine the model's parameter space as an energy landscape — a mountainous terrain where altitude represents the loss function. During Phase 1 (memorization), the model rolls into a local valley: a lookup table solution. The landscape around this valley is flat — standard metrics see no gradient pushing the model elsewhere.

ENERGY LANDSCAPE OF GROKKINGEnergy │ │ Memorization Trigonometric │ Valley Valley │ │ ╔═══╗ Barrier ╔═══╗ │ ║ ║ ┌──────┐ ║ ║ │ ║ ● ║ │ │ ║ ║ │ ║ ║ │ │ ║ ★ ║ ← deeper minimum │ ╚═══╝ │ │ ╚═══╝ (true solution) │ └──────┘ └──────────────────────────────────── Parameters

● = model during Phase 1-2 (stuck in local minimum) ★ = model after Phase 3 (grokked — found global minimum)

But the trigonometric solution — the real answer — lies in a deeper valley on the other side of an energy barrier. During Phase 2, weight decay (regularization) acts like slowly raising the temperature: it destabilizes the memorization valley, giving the model the energy to escape. When the model finally crests the barrier and falls into the trigonometric valley, that is the phase transition — the moment of grokking.

This is not a metaphor. The Ising model's phase transition, grokking's sudden generalization, and the Hopfield network's convergence to stored memories are all described by the same mathematics: systems minimizing energy functions and settling into stable configurations. In October 2024, the Nobel Prize in Physics went to John Hopfield and Geoffrey Hinton for recognizing this connection — that the physics of magnets and the learning dynamics of neural networks are governed by the same laws.

This physical perspective also explains why weight decay accelerates grokking. Stronger regularization raises the effective "temperature" of the system, making it easier for the model to escape local minima. Weaker regularization keeps the model trapped in the memorization valley longer. It is thermodynamics applied to learning.

stateDiagram-v2
    [*] --> Disorder: Initialize random weights
    Disorder --> Memorization: Gradient descent (fast)
    Memorization --> Memorization: Standard metrics flat
    Memorization --> CriticalPoint: Weight decay raises "temperature"
    CriticalPoint --> Generalization: Phase transition (sudden)
    Generalization --> [*]: Stable trig solutionstate Disorder {
    direction LR
    note right of Disorder: No structure\nRandom activations
}

state Memorization {
    direction LR
    note right of Memorization: Local minimum\nLookup table solution
}

state CriticalPoint {
    direction LR
    note right of CriticalPoint: Energy barrier crossed\nLike Ising critical temp
}

state Generalization {
    direction LR
    note right of Generalization: Global minimum\nTrigonometric circuits
}

Emergent Capabilities in Large Models

Researchers have observed grokking-like phenomena in larger models and more complex tasks. The sudden appearance of new capabilities in large language models — what the field calls "emergent abilities" — may share underlying mechanisms with grokking in small transformers.

When a model suddenly becomes capable of chain-of-thought reasoning, or suddenly learns to follow instructions, or suddenly develops the ability to do arithmetic at scale — these phase transitions echo the sudden generalization in grokking. The connection remains an active area of research, but grokking provides a concrete, interpretable example of how such sudden shifts can occur.

The broader question for AI development is whether we can predict when these transitions will happen, and whether we can design training procedures that encourage them rather than leaving them to chance. The physics of phase transitions suggests that these shifts are not random — they are governed by the geometry of the loss landscape and the dynamics of optimization. Understanding that geometry may be the key to making emergent capabilities predictable rather than surprising.

Latent Operators: What Grokking Teaches Us About Mental Rotation

Grokking is not just about modular arithmetic. The same principle — a model reorganizing its internal representations to discover the true structure of a problem — appears in a much broader class of tasks involving symmetries.

Consider mental rotation: when you see a rotated image of a familiar object, your brain does not memorize every possible orientation. Instead, it learns the transformation itself — the operation of rotation — and applies it to recognize the object from any angle. Recent research shows that neural networks can learn similar latent operators — internal transformations that act on representations in latent space rather than on raw pixels.

MENTAL ROTATION: BRUTE FORCE vs. LATENT OPERATORS

Brute Force (Memorization) Latent Operators (Grokking-like) ┌─────────────────────────┐ ┌─────────────────────────┐ │ Store every rotation: │ │ Learn the rotation │ │ 🪑 0° → "chair" │ │ OPERATOR in latent │ │ 🪑 15° → "chair" │ │ space: │ │ 🪑 30° → "chair" │ │ │ │ 🪑 45° → "chair" │ │ Object → Canonical │ │ ... │ │ Pose → Recognize │ │ 🪑 345° → "chair" │ │ │ │ (24 entries per object)│ │ Works for ANY angle │ │ BRITTLE: fails at 17° │ │ including never-seen │ └─────────────────────────┘ └─────────────────────────┘

The parallel to grokking is exact. A network that memorizes rotation angles is in Phase 1 — it has a lookup table. A network that discovers the rotation operator has grokked — it found the underlying symmetry. And just as the grokking model transitions from a memorized lookup table to trigonometric representations, a vision model transitions from storing individual orientations to learning a canonical pose in latent space plus a transformation operator.

This solves one of the deepest problems in AI vision: brittleness. A model trained on objects at 0°, 90°, 180°, and 270° will fail at 45° — unless it discovers the symmetry. The grokking insight is that models can discover these symmetries without being told about them, but only if training continues past the memorization phase. The key is what researchers call data efficiency through structure discovery — the same principle that lets the modular arithmetic model generalize from 70% of examples to 100% accuracy.

This connects directly to how the brain works. Neuroscientists have found that grid cells in the entorhinal cortex — the same cells that create hexagonal coordinate systems for spatial navigation — also encode abstract transformations. The brain does not memorize every rotated view of a coffee cup. It learns the rotation operation and applies it in a latent space provided by these grid cells. Grokking models and biological brains appear to converge on the same solution: learn the symmetry, not the instances.

Surrogate Problems: When Solving the Wrong Problem Works Better

One of the most counterintuitive lessons from grokking research emerged from evolutionary AI experiments at Sakana AI in early 2026. Their Shinka Evolve system — which uses frontier LLMs as mutation operators inside an evolutionary algorithm — discovered that solving a relaxed version of a problem often converges faster than solving the exact formulation.

In circle packing experiments, the team first used a fitness function that allowed tiny overlaps between circles (a surrogate problem). The system found state-of-the-art solutions rapidly. When they reran the experiment with the exact constraint (zero overlap), it took significantly longer to reach the same quality. The surrogate problem served as a stepping stone — an intermediate discovery that enabled the final breakthrough.

This echoes grokking's own dynamics. During Phase 2, the model is not solving the modular arithmetic problem directly. It is building trigonometric representations — a different, more fundamental problem whose solution happens to solve the original. The model discovers that "how do sine and cosine relate to modular groups?" is the right question to answer, even though nobody asked it.

Kenneth Stanley's Why Greatness Cannot Be Planned formalizes this insight: the path to a solution often runs through problems you did not set out to solve. In evolutionary computation, maintaining a diverse population of "stepping stones" — partially useful solutions that do not directly optimize the target — produces better outcomes than pure optimization. Open-endedness researchers argue that this is why biological evolution produces such remarkable complexity: it is not optimizing for anything in particular, just accumulating useful building blocks.

The connection to grokking is structural:

Memorization = optimizing the obvious objective (fit the training data)
Phase 2 stagnation = building stepping stones (trigonometric representations that do not yet improve test performance)
Grokking = the stepping stones suddenly assemble into a complete solution

Current AI agents optimize for exactly the problem they are given. But the grokking phenomenon — and the Shinka Evolve experiments — suggest that the next frontier is co-evolution of problems and solutions, where the AI system reformulates the problem itself as part of the search process. Robert Lange of Sakana AI calls this the "problem problem": not just finding solutions, but inventing the right problems to solve.

For agentic engineering, this means designing systems where agents can explore tangential paths, accumulate intermediate insights, and bring back unexpected stepping stones — rather than converging as quickly as possible on the first plausible answer.

🤖 From Grokking to Workspace Intelligence

Grokking demonstrates that AI systems can develop deep understanding from exposure to data — moving from surface-level pattern matching to discovering the fundamental structure of a problem. This principle resonates with how modern AI agents develop contextual intelligence in practical applications.

Taskade AI agents embody a similar philosophy of progressive understanding. When you train an agent on your workspace data — documents, conversations, project histories, and team workflows — it does not just memorize keywords. Through persistent memory and knowledge training, agents build progressively deeper representations of how your team operates.

Here is the connection:

Grokking models go from memorizing (X + Y) mod 113 to discovering trigonometric identities
Taskade agents go from matching keywords to understanding workflow context — knowing that a "sprint review" means different things to your engineering team versus your marketing team

This progression is powered by Workspace DNA: Memory (persistent context from projects and documents), Intelligence (15+ frontier models from OpenAI, Anthropic, and Google), and Execution (100+ integrations and automated workflows that act on insights).

What you can build today:

Custom AI agents with 34 built-in tools, persistent memory, and slash commands that understand your domain
Automated workflows that trigger based on context — not just rules — using durable execution
Genesis Apps that turn prompts into live dashboards, portals, and tools your team can use immediately
Multi-agent teams where specialized agents collaborate on complex tasks, each with their own knowledge base

The same way grokking reveals that small models can discover profound mathematical truths, workspace AI reveals that agents with the right data and architecture can develop genuine operational intelligence. Try building your first AI agent →

🔮 The Bigger Picture

From a word in a 1961 science fiction novel about a Martian who understood things so deeply he could make them disappear, "grokking" has become one of the most important concepts in modern AI research.

The phenomenon reminds us that artificial intelligence is genuinely strange. A neural network with no concept of trigonometry, trained on nothing but patterns of zeros and ones, independently discovers sine waves, Fourier analysis, and the sum-of-angles identity. It finds the same solution that took human mathematicians centuries to develop — and it finds it by accident, while its trainer was on vacation.

Karpathy's observation keeps proving true: "Training LLMs is less like building animal intelligence and more like summoning ghosts." These intelligences are alien. They do not think the way we think. They find solutions we would not consider, using representations we can barely interpret.

But grokking also offers hope. It is one of the few cases in all of AI where we can fully understand what a neural network is doing — a transparent box in a world of black boxes. As the field of mechanistic interpretability grows, grokking provides both inspiration and methodology. If we can understand grokking, maybe we can understand the rest.

The next time you watch a training loss curve flatten out and think "time to stop," remember: the model might be about to grok. 🧪

Watch: Which AI model should you build with? — choosing the right model for your Taskade Genesis apps.

❓ Frequently Asked Questions

What is grokking in AI?

Grokking is a phenomenon in neural network training where a model suddenly transitions from memorizing its training data to truly generalizing — understanding the underlying mathematical pattern. It typically occurs long after the model has perfectly memorized the training set, during a period when standard training metrics show no improvement. The term was coined by OpenAI researchers in 2022, borrowing from Robert Heinlein's 1961 novel Stranger in a Strange Land.

How is grokking different from normal learning?

In normal machine learning, training and test performance improve together — as the model learns the training data, it simultaneously gets better at unseen examples. In grokking, the model first memorizes the training data (with no test improvement), then appears to stagnate for thousands of steps, and finally achieves sudden perfect generalization. The key difference is the delayed phase transition between memorization and understanding.

Why did the model discover trigonometry?

The model was not taught trigonometry or given any mathematical hints. It discovered trigonometric representations because circular functions are the natural solution to modular arithmetic. Modular arithmetic is inherently circular — numbers "wrap around" after reaching the modulus, just like hours on a clock. Sine and cosine functions are the mathematical tools that describe circular behavior. The model found the most efficient solution through gradient descent, and that solution happened to be trigonometry.

Can grokking happen in large language models?

Evidence suggests that grokking-like dynamics occur in larger models and more complex tasks. The sudden emergence of new capabilities in large language models — such as chain-of-thought reasoning or instruction following — may share underlying mechanisms with grokking. However, the clean three-phase pattern is most clearly demonstrated in small models on algorithmic tasks, where the full internal computation can be analyzed.

What is excluded loss and why does it matter?

Excluded loss is a diagnostic metric that removes specific frequency components from a model's output before measuring performance. It was developed by Neel Nanda and collaborators to reveal hidden learning during Phase 2 of grokking. Standard loss metrics cannot detect the trigonometric solution being built because the memorized solution masks it. Excluded loss strips away the memorization component, revealing steady progress toward generalization even when the model appears stuck.

What does grokking mean for AI training practices?

Grokking suggests that early stopping — halting training when metrics plateau — may cause us to miss significant generalization improvements. It also suggests that weight decay and regularization play important roles in encouraging the transition from memorization to generalization. Stronger regularization tends to speed up grokking, while weaker regularization delays it.

How does grokking connect to mechanistic interpretability?

Grokking is a cornerstone case study in mechanistic interpretability — the field of reverse-engineering neural networks to understand their internal computations. Because the grokked solution (trigonometric identities for modular arithmetic) is mathematically clean and fully understood, researchers can verify their interpretability techniques against a known ground truth. This makes grokking models invaluable testbeds for developing tools that might eventually explain frontier models.

How can I experiment with grokking myself?

You can reproduce grokking with relatively modest compute. Train a small transformer (1-2 layers, 128-dimensional embeddings) on modular addition with a prime modulus (P = 113 is the standard benchmark). Use about 70% of the complete dataset for training, apply weight decay regularization, and train for at least 10,000 steps. Monitor both training and test accuracy — you should see the characteristic flat period followed by sudden generalization.

🚀 Build AI That Understands Your Work

Grokking shows that neural networks can achieve deep understanding — not just memorization. Taskade AI agents bring that same principle to your workspace.

✅ Custom AI agents with persistent memory and 34 built-in tools
✅ Knowledge training — agents learn from your docs, projects, and team data
✅ Multi-model support — 15+ frontier models from OpenAI, Anthropic, and Google
✅ Automated workflows with 100+ integrations
✅ Genesis Apps — build live tools from prompts, deploy instantly

👉 Start building with Taskade AI agents →

💡 Before you go... Check out these related articles:

What Are AI Agents? — How autonomous agents plan, reason, and act
How Do LLMs Work? — Transformers, training, and inference explained
What Is Mechanistic Interpretability? — Reverse-engineering neural networks
What Is Generative AI? — The technology behind modern AI
What Is OpenAI? — History of ChatGPT and GPT models
What Is Anthropic? — History of Claude AI
Agentic Workspaces — AI-powered workspace intelligence
From Bronx Science to Taskade Genesis — Connecting the dots of AI history
They Generate Code. We Generate Runtime — The Genesis Manifesto
The BFF Experiment — From Noise to Life
What Is Artificial Life? — How intelligence emerges from code
What Is Intelligence? — From neurons to AI agents
Explore Taskade Community — Templates, agents, and workflows