download dots
Scaling Laws

Scaling Laws

7 min read
On this page (16)

Definition: Scaling laws are empirical power-law relationships that predict how a neural network's performance improves as you scale up its parameters, training data, and compute. Discovered and formalized by OpenAI's Kaplan et al. in 2020 and refined by DeepMind's Hoffmann et al. in 2022 (the Chinchilla paper), scaling laws are the reason the AI industry spent 2021โ€“2024 pouring billions of dollars into ever-larger models โ€” because the math said it would work.

The scaling-laws era has evolved twice since then. The pretraining scaling era (2020โ€“2023) was about parameter and dataset size. The inference-compute scaling era (2024โ€“2026) is about reasoning tokens at test time. Both follow power laws. Both are now exhausted in some directions and wide open in others.

Why Scaling Laws Changed Everything

Before scaling laws, model choices felt like art. Some architectures worked; others did not. Compute budgets were allocated based on intuition. The Kaplan paper showed that for a fixed architecture class, loss decreases predictably as parameters N, dataset size D, and training compute C increase:

Loss โ‰ˆ L_โˆž + ฮฑ ยท N^(-ฮฒ_N) + ฮณ ยท D^(-ฮฒ_D) + ฮด ยท C^(-ฮฒ_C)

Translation: if you double the compute, you can predict the loss improvement. If you 10x the parameters, you can predict the loss improvement. The predictions were so accurate that labs began betting strategy on them โ€” OpenAI's GPT-3 at 175B parameters, PaLM at 540B, and every follow-up were planned from scaling curves extrapolated into the future.

The Chinchilla Correction (2022)

DeepMind's Chinchilla paper rewrote the playbook. Kaplan had concluded that parameters mattered more than data. Chinchilla showed the opposite: for a fixed compute budget, you should train a smaller model on more data. The optimal ratio is roughly 20 tokens of training data per model parameter.

Model Parameters Training Tokens Tokens per Parameter
GPT-3 175B 300B 1.7
Gopher 280B 300B 1.1
Chinchilla 70B 1.4T 20.0
LLaMA-2 70B 2.0T 28.6
LLaMA-3 70B 15T 214

Chinchilla-optimal became the default recipe: smaller parameters, vastly more data. LLaMA-3 went well past Chinchilla-optimal because training is a one-time cost and inference is a per-query cost โ€” if you can squeeze more capability into fewer parameters, every future inference call is cheaper.

Pre-Chinchilla (2020-2022) Post-Chinchilla (2022-2024) Over-Chinchilla (2024+) More params Bigger model Balanced params + data Smaller, stronger Train past optimum Faster inferencefor same quality

The Three Scaling Regimes

The 2026 picture has three layered scaling regimes, each with its own power law:

1. Pretraining Scale

More parameters + more data + more FLOPs โ†’ lower loss. Still improving but subject to diminishing returns. The frontier is now in the tens of trillions of training tokens, and the data wall (there is a finite amount of high-quality text in the world) is biting hard.

2. Post-Training Scale

After pretraining, RLHF, DPO, constitutional AI, and RL on verifiable rewards (math, code) all extract additional capability from the same base. This is where most 2024โ€“2026 progress actually came from. DeepSeek R1 and OpenAI o1 showed that a small amount of post-training RL on reasoning traces can match much larger base models.

3. Inference-Time Compute

The newest regime. Instead of making the model bigger, let it think longer at inference time. Every additional chain-of-thought token is a unit of compute spent on the current problem. OpenAI's o1 demonstrated that inference compute scales capability on reasoning benchmarks at a predictable power-law rate โ€” the same slope as pretraining compute scaling.

Capability gain per 10x inference compute โ‰ˆ pretraining capability gain per 10x

This is the single biggest structural shift in AI since transformers. It moved the scaling story from "we need bigger training clusters" to "we need faster inference with longer thinking budgets."

What Scaling Laws Predict

Resource Scaling Effect Diminishing Returns?
Parameters Loss decreases as power law Past 10โ€“30B, substantial for text
Training tokens Loss decreases as power law Past ~200x parameters, subtle
Training compute Optimal mix of N and D Bounded by wall-clock
Inference compute Reasoning quality scales New regime, not exhausted
RL post-training Unlocks latent capability Saturates for each ability
Multimodal data Cross-modal transfer Early regime

The practical upshot: bigger is not always better, longer-thinking-at-inference is the new frontier, and post-training remains underexploited. A 70B model trained on Chinchilla-compliant data, given a generous inference-time thinking budget, often beats a 500B Chinchilla-violating model giving one-shot answers.

Scaling Laws and Product Strategy

Scaling laws decide what products are possible:

  • If inference-time scaling holds, long-thinking agents are economical for hard tasks.
  • If pretraining is nearing saturation, the leading labs' advantage compresses.
  • If post-training RL dominates, customization becomes the moat.
  • If data is the binding constraint, synthetic-data generation is the unlock.

Taskade's platform is built on the assumption that inference economics will continue improving faster than training capabilities โ€” which is why Genesis auto-routes across 11+ frontier models, pays per credit rather than per model, and uses longer-thinking reasoning modes for complex builds. The models underneath keep improving. The credits stay the same.

Frequently Asked Questions About Scaling Laws

What are scaling laws in AI?

Scaling laws are empirical power-law relationships that predict how a neural network's loss decreases as parameters, training data, and compute increase. They let labs plan multi-year roadmaps from extrapolated curves.

What is Chinchilla-optimal?

Chinchilla-optimal is the DeepMind finding that for a fixed compute budget, you should train a smaller model on more data โ€” roughly 20 tokens per parameter. It replaced the Kaplan-era belief that more parameters were better.

Are scaling laws still holding in 2026?

Pretraining scaling is slowing as data quality becomes the bottleneck. Post-training and inference-time scaling are still improving rapidly. The frontier has shifted from "make the model bigger" to "let the model think longer."

What is inference-time scaling?

Inference-time scaling is the power law that relates reasoning tokens spent at inference to capability gains. OpenAI o1 and DeepSeek R1 demonstrated that long chain-of-thought reasoning scales capability at roughly the same slope as pretraining compute.

How do scaling laws affect Taskade?

Taskade auto-routes between frontier models and charges in credits, so as the underlying models improve via scaling-law advances, users get the benefit automatically. Business and Max plans trade off request burst vs credit burst, giving users control over how much inference-time thinking they want for Genesis builds.

Further Reading