In 2020 and 2021, the AI world had a simple mantra: bigger is better. GPT-3 had 175 billion parameters. Google's PaLM would push to 540 billion. Labs were racing to train the largest model they could afford, often spending tens of millions of dollars on a single training run.
There was just one problem. They were all doing it wrong.
In March 2022, a team at DeepMind published a paper that sent shockwaves through the AI industry. Their finding was as simple as it was devastating: virtually every large language model in existence had been significantly undertrained. These models had too many parameters and were trained on far too little data.
The proof? A model called Chinchilla — with just 70 billion parameters, less than half the size of DeepMind's own Gopher (280B) — outperformed Gopher on nearly every benchmark. It used the exact same compute budget. The only difference was how that compute was allocated: fewer parameters, more data.
This wasn't a marginal improvement. It was a paradigm shift. Let's start with a question to see what your intuition says.
🎯 Interactive — Prediction Challenge
You have a fixed compute budget of 5.76 × 10²³ FLOPs. How would you allocate it? Move the slider to choose your split between model size and training data.
Training Data: 550B tokens
Move the slider to see the estimated loss
The compute-optimal point is not where most labs put it in 2021. Can you find the sweet spot?
II. Before Chinchilla: The Scaling Gold Rush
To understand why the Chinchilla paper mattered, we need to rewind. The year is 2020. OpenAI has just released GPT-3, and the world is stunned. The model can write poetry, code, and even do basic math — all from a single prompt. The message was clear: scale up the parameters, and intelligence follows.
This belief wasn't unfounded. In January 2020, Kaplan et al. at OpenAI published "Scaling Laws for Neural Language Models" — a landmark paper that showed smooth, predictable improvements in model performance as you increased three things: parameters (N), data (D), and compute (C).
But Kaplan's paper had a particular conclusion that shaped the entire industry's strategy: it suggested that model performance was more sensitive to the number of parameters than to the amount of training data. The implication? If you had a bigger compute budget, you should mostly spend it on a bigger model.
And so began the great parameter arms race.
📅 Interactive — The Parameter Arms Race
Click on each milestone to see how the race unfolded. Notice how parameters grew much faster than training data.
Feb 2019
GPT-2 — 1.5B parameters
Trained on ~40B tokens (WebText). OpenAI famously withheld release, fearing misuse. The data-to-params ratio: ~27 tokens per parameter.
Jan 2020
Kaplan Scaling Laws published
Suggested scaling parameters faster than data. This single paper influenced compute allocation at every major lab for the next two years.
Jun 2020
GPT-3 — 175B parameters
Trained on ~300B tokens. Ratio: ~1.7 tokens per parameter. A 100× jump in parameters, but only ~7.5× more data. The pattern was set.
Jan 2022
Megatron-Turing NLG — 530B parameters
NVIDIA + Microsoft's entry. Trained on ~339B tokens. Ratio: ~0.6 tokens per parameter. Even more lopsided toward parameters.
Dec 2021
Gopher — 280B parameters
DeepMind's flagship. Trained on 300B tokens. Ratio: ~1.1 tokens per parameter. Strong performance, but was it optimal?
Mar 2022
Chinchilla — 70B parameters ✨
Trained on 1.4 TRILLION tokens. Ratio: ~20 tokens per parameter. Used the same compute as Gopher but allocated it radically differently. Outperformed everything.
Notice the tokens-per-parameter ratio. Pre-Chinchilla models were heavily skewed toward parameters. Chinchilla inverted the strategy.
The prevailing wisdom was essentially: "We have X FLOPs. Let's make the model as big as possible, train it for one epoch on whatever data we have, and call it done." The Chinchilla paper showed this was leaving enormous performance on the table.
III. The Core Insight: Equal Scaling
Here is the central claim of the Chinchilla paper, stated as plainly as possible:
Key Finding
For compute-optimal training, the number of model parameters and the number of training tokens should be scaled equally — in roughly equal proportions — as compute budget increases.
Let's unpack what this means. If you double your compute budget, you shouldn't just double the model size. You should increase both the model size and the training data by roughly the same factor — about 1.4× each (since 1.4 × 1.4 ≈ 2).
The paper proposes that the optimal number of training tokens should be approximately 20× the number of parameters. A 10B parameter model should see ~200B tokens. A 70B model should see ~1.4T tokens. A 500B model would need ~10T tokens.
This was a radical departure from Kaplan's earlier findings, which suggested parameters should grow faster than data. Let's visualize what this means in practice.
⚖️ Interactive — The Equal Scaling Rule
Use the slider to increase compute budget and watch how the optimal allocation of parameters and data should grow together.
Optimal Parameters
400M
Optimal Tokens
8.0B
Tokens-per-parameter ratio: ~20×
Both bars should grow at roughly the same rate. The optimal tokens-per-parameter ratio stays near 20× regardless of scale.
Think of it like cooking. Kaplan's earlier work was like saying: "If you want a better meal, just get a bigger oven." The Chinchilla paper says: "Actually, a slightly smaller oven with more ingredients makes a much better dish." The oven is your model; the ingredients are your data.
IV. Three Approaches, One Answer
One of the most compelling aspects of the Chinchilla paper is that the authors didn't rely on a single methodology. They used three completely independent approaches to derive the optimal scaling law — and all three converged on the same answer. That's the kind of evidence that makes a paper hard to argue with.
🔬 Interactive — Three Paths to One Truth
Step through each approach to see how they derived the optimal scaling.
Approach 1: Fix Compute, Vary Allocation
For each of several fixed compute budgets (from 10¹⁸ to 10²¹ FLOPs), they trained over 400 models ranging from 70M to 16B parameters.
For each budget level, they varied how much was allocated to model size vs. training data. Then they found which allocation produced the lowest loss — the "valley" of the loss curve.
Each compute level produces a U-shaped curve. Too few parameters wastes compute on unnecessary passes; too many leaves each parameter undertrained.
Approach 2: Parametric Loss Function (IsoFLOP)
They fitted a parametric model of the loss function as a function of both N (parameters) and D (data):
Parametric Loss Model
L(N, D) = E + A / Nα + B / Dβ
Where E is the irreducible loss (entropy of natural language), and the other terms capture how loss decreases with more parameters and data. By fitting α and β from training runs, they could predict the optimal allocation for any compute budget.
Result: α ≈ 0.34, β ≈ 0.28 — remarkably similar exponents, confirming that parameters and data matter roughly equally.
Approach 3: Direct Fit of Optimal N and D
The simplest approach: take all the optimal (N, D) pairs found from Approach 1, plot them against compute, and fit a power law directly.
They found:
Optimal Scaling
Nopt ∝ C0.50 and Dopt ∝ C0.50
Both scale as the square root of compute — meaning they grow at exactly the same rate. Double compute → ~1.41× more parameters AND ~1.41× more data.
✓ All three approaches agree: scale parameters and data equally.
Three independent methodologies. One consistent answer. That's how you move an entire industry.
The convergence of all three approaches is what gave the paper its weight. Any one method might have had flaws or biases, but when three fundamentally different analytical strategies agree, you can be pretty confident in the result.
V. The Showdown: Chinchilla vs. Gopher
Theory is nice. But nothing convinces like a head-to-head battle. The Chinchilla paper didn't just propose a new scaling law — it built a model to prove it.
Gopher was DeepMind's state-of-the-art LLM: 280 billion parameters, trained on 300 billion tokens, using approximately 5.76 × 10²³ FLOPs of compute.
Chinchilla used the exact same compute budget. But instead of 280B parameters, it had just 70B — a quarter the size. The saved compute was redirected into training on 4.7× more data: 1.4 trillion tokens.
The result? Chinchilla uniformly outperformed Gopher. Not on one benchmark. On virtually all of them.
🏆 Interactive — Benchmark Battle
Click each benchmark to see how Chinchilla (70B) compared to Gopher (280B). Green = Chinchilla wins.
Click a benchmark above to see details
Same compute budget. 4× fewer parameters. Better results on every task. That's the power of compute-optimal training.
This wasn't just a theoretical exercise. It had immediate practical implications: a model 4× smaller is 4× cheaper to serve. Inference costs — the cost of actually running the model for users — scale roughly linearly with parameter count. Chinchilla delivered better performance and cost less to deploy.
Why This Matters for Deployment
A 70B model requires roughly 140 GB of memory in fp16, while 280B requires ~560 GB. That's the difference between running on a single server vs. needing a multi-node setup. Chinchilla wasn't just smarter — it was dramatically more practical.
VI. The Math Behind Compute-Optimal Scaling
Let's get into the equations — but gently. The core math is surprisingly elegant.
The total compute C for training a transformer is approximately:
Compute Approximation
C ≈ 6 × N × D
Where N is the number of parameters and D is the number of training tokens. The factor of 6 comes from the forward and backward passes through the network (roughly 2 FLOPs per parameter per token for the forward pass, and 4 for backward).
The Chinchilla paper's key insight is that the loss function has a specific shape. If you fix compute C, there's a unique optimal pair (N*, D*) that minimizes loss. And that pair follows:
Both scale with the same exponent (0.50), confirming that parameters and data should grow at the same rate. The ratio D/N works out to roughly 20, meaning about 20 tokens per parameter.
🧮 Interactive — Chinchilla Compute Calculator
Enter your compute budget (in FLOPs) and see the Chinchilla-optimal allocation. Or enter a model size to see how much data and compute you need.
FLOPs
billion
Optimal Parameters
67B
Optimal Tokens
1.4T
Total Compute
10²³
Try plugging in famous models: GPT-3 had 175B params but only 300B tokens. According to Chinchilla, it should have had ~3.5T tokens!
Here's a fun exercise: plug in GPT-3's 175B parameters. According to the Chinchilla scaling law, it should have been trained on roughly 3.5 trillion tokens. It was actually trained on 300 billion — about 12× too few. That's a lot of wasted potential.
VII. Kaplan vs. Hoffmann: Dueling Scaling Laws
The Chinchilla paper didn't just present new results — it directly contradicted the most influential scaling paper in AI. The Kaplan et al. (2020) paper from OpenAI had argued for very different optimal scaling ratios.
Kaplan found that parameters should scale ~3× faster than data. Double your compute? Make the model about 5× bigger but only train on 1.5× more tokens. This led to the "bigger model, same data" approach that dominated the field.
Hoffmann et al. found the opposite: parameters and data should scale equally. So what went wrong with Kaplan's analysis?
⚔️ Interactive — Dueling Scaling Laws
Use the slider to increase compute budget and see how the two scaling laws prescribe very different allocations.
Kaplan (2020) says:
Parameters: 175B
Tokens: 300B
Chinchilla (2022) says:
Parameters: 175B
Tokens: 3.5T
At 1× GPT-3 compute, Kaplan would build GPT-3 (175B). Chinchilla would build a 67B model trained on 1.4T tokens.
As compute grows, the gap between the two prescriptions becomes enormous. Kaplan would build ever-larger undertrained giants.
The key methodological differences that led Kaplan astray:
1. Learning rate schedules. Kaplan's experiments used a fixed learning rate schedule that wasn't adjusted to match the number of training tokens. When training for longer, you need to decay the learning rate more gradually. Without this, longer training runs look artificially worse — biasing the results toward bigger models with shorter training.
2. Not training to convergence. Many of Kaplan's smaller models may not have been trained long enough, making small models appear less capable than they actually are.
3. Narrow parameter range. Kaplan's experiments covered a narrower range of model sizes, making the extrapolation to very large scales less reliable.
These sound like minor technical details, but they compounded into a conclusion that misdirected billions of dollars of compute investment. Methodological rigor matters — especially when entire industries follow your prescriptions.
VIII. The Chinchilla Tax
After the paper dropped, a delightful term entered the AI lexicon: the "Chinchilla tax." It refers to the extra inference cost you pay forever because you trained a model that's bigger than it needed to be.
Here's the logic: Suppose you trained a 280B parameter model (like Gopher) when a compute-optimal 70B model (like Chinchilla) would have been just as good or better. Every single API call, every chat response, every inference step now costs you 4× more than it should. Multiply that by millions of users and months of deployment, and you're talking about hundreds of millions of dollars in unnecessary inference costs.
The training run is a one-time cost. Inference runs forever.
💰 Interactive — The Chinchilla Tax Calculator
Estimate how much extra you'd spend serving an oversized model vs. the compute-optimal one.
This is a simplified estimate, but the directional insight is clear: an oversized model is a gift that keeps on taking.
The "Chinchilla tax" is why this paper had such outsized impact on the industry. It wasn't just about training better models — it was about the economics of deployment. In a world where LLMs are served to millions of users, inference costs dominate. A smaller model with the same (or better!) capabilities is worth enormously more than a bigger one.
Some wags noted the irony: the entire industry had been voluntarily paying a massive Chinchilla tax for years — because they were following scaling laws that told them bigger was always better.
IX. How This Changed Every AI Lab
The Chinchilla paper didn't just start an academic debate. It fundamentally changed how every major AI lab allocates compute. The evidence is visible in every major model released after March 2022.
🏭 Interactive — The Post-Chinchilla Era
Hover over each model to see how the Chinchilla paper influenced its design. Notice how the tokens-per-parameter ratio increased dramatically.
LLaMA
Meta, Feb 2023
65B / 1.4T
~21 tok/param ✅
Explicitly cited Chinchilla. Trained the 7B model on 1T tokens — far more than Kaplan would prescribe. Open-sourced, catalyzing the entire open-source LLM movement.
PaLM 2
Google, May 2023
340B / 3.6T
~10.6 tok/param
Google dramatically increased training data. PaLM 2 was smaller than PaLM 1 (540B) but significantly better — a direct validation of Chinchilla's thesis.
LLaMA 2
Meta, Jul 2023
70B / 2T
~28 tok/param ✅
Meta went even beyond Chinchilla-optimal, training on more data than the scaling law would suggest. Turns out, even more data than 20× helps — you can "over-train" for inference savings.
Mistral 7B
Mistral, Sep 2023
7B / ?T
Heavily data-fed
Mistral built a tiny 7B model that rivaled much larger ones. The Chinchilla philosophy taken to an extreme: maximum data, minimum parameters, maximum efficiency.
LLaMA 3
Meta, Apr 2024
70B / 15T
~214 tok/param 🔥
Meta pushed far beyond Chinchilla-optimal, deliberately "over-training" the model so it would be smaller and cheaper at inference time. A strategy that only makes sense once you accept Chinchilla's framework.
GPT-4
OpenAI, Mar 2023
~1.8T / ~13T
MoE, ~7 tok/param
While details are sparse, leaks suggest GPT-4 is a Mixture-of-Experts model. The MoE architecture itself is partly a response to Chinchilla: get more parameters but only activate a fraction per token.
Notice the trend: post-Chinchilla models all dramatically increased their training data budgets. The parameter arms race was over.
The pattern is unmistakable. Before Chinchilla, labs scaled parameters 100× while barely increasing data 10×. After Chinchilla, the data budgets exploded: 1T, 2T, 15T tokens. Some teams deliberately "over-trained" their models on more data than Chinchilla-optimal, accepting slightly worse loss to get a smaller, cheaper model for deployment.
This strategy — training beyond the compute-optimal point for inference efficiency — has become known as inference-optimal training, and it's a direct intellectual descendant of the Chinchilla paper. You can't think about inference optimization until you first accept the framework that says "bigger isn't always better."
X. Limitations and the Data Wall
The Chinchilla paper is brilliant, but it's not without problems. The most immediate practical challenge is what some call the "data wall."
If a 1 trillion parameter model needs ~20 trillion tokens of high-quality training data, where exactly does all that data come from? The entire internet (as of 2022) was estimated to contain roughly 5–10 trillion tokens of high-quality text. That's a hard ceiling.
Several strategies have emerged to address this:
🧱 Interactive — The Data Wall
Select a model size to see how much data Chinchilla prescribes vs. how much quality data exists. Click each solution to learn more.
Chinchilla-optimal data needed:
10.0T tokens
Estimated high-quality internet text:
~7T tokens
⚠️ A 500B model needs ~10T tokens. That's more high-quality text than exists on the public internet!
Solutions being explored (click to expand):
🤖 Synthetic Data Generation
🔄 Multi-Epoch Training
🖼️ Multimodal Data
🧹 Better Data Curation
The Chinchilla scaling law assumes infinite data. In practice, data is the bottleneck.
Other limitations worth noting:
The 20× ratio may shift with architecture changes. The Chinchilla experiments used standard dense Transformers. Mixture-of-Experts models, which activate only a fraction of parameters per token, have different compute profiles. The exact ratio might be different for different architectures.
Quality ≠ Quantity. The paper treats all tokens as equal. But a token from a carefully curated textbook isn't the same as a token from a random Reddit thread. Data quality modifies the effective scaling law in ways that are still being researched.
Downstream performance vs. loss. Chinchilla optimizes pre-training loss (perplexity). But what matters in practice is downstream task performance after fine-tuning. The mapping from loss to usefulness isn't always straightforward.
XI. Legacy and What Came Next
The Chinchilla paper is one of those rare works that instantly divides AI history into "before" and "after." Its influence extends far beyond the specific numbers it proposed.
It made data a first-class citizen. Before Chinchilla, the AI discourse was all about model size. After Chinchilla, labs started investing heavily in data pipelines, data quality, and data curation. Common Crawl refinement became as important as architecture innovation.
It enabled the open-source revolution. If bigger was always better, only the richest labs could compete. But Chinchilla showed that a well-trained 7B or 13B model could punch far above its weight. This opened the door for Meta's LLaMA, Mistral, and the entire open-source ecosystem.
It shifted the cost conversation from training to inference. Once you accept that a smaller model can match a larger one, the economics of deployment become paramount. This led to quantization, distillation, and inference-optimization becoming major research areas.
🧠 Interactive — Test Your Understanding
Five questions to check if the Chinchilla insight has really sunk in.
Perhaps the deepest legacy of the Chinchilla paper is philosophical. It taught the AI community that intuition about scaling can be dangerously wrong. The field's most prominent researchers, at the best-funded labs, had been following a suboptimal strategy for years — because a single paper's methodology was slightly off, and nobody questioned it seriously until Hoffmann et al. did the rigorous experiments.
If the smartest people in AI can be wrong about something this fundamental, what else might the field be getting wrong?
XII. Summary: The Scaling Cheat Sheet
Let's compress everything into a quick reference.
📋 Interactive — Chinchilla Cheat Sheet
Click on any row for a quick refresher on each key concept.
Concept
Pre-Chinchilla
Post-Chinchilla
Scaling priority
Parameters first
Equal: params & data
Tokens per parameter
~1-5×
~20× (or more)
Key bottleneck
Compute & model size
Data quality & quantity
Inference cost
Ignored during training
Central design concern
Compute formula
C ≈ 6ND
C ≈ 6ND (same, better used)
Optimal scaling exponent
N ∝ C⁰·⁷³ (Kaplan)
N ∝ C⁰·⁵⁰ (Hoffmann)
This table might be the most expensive correction in AI history. Billions of dollars of compute were allocated based on the left column.
The One-Sentence Summary
If you're training a language model, allocate your compute budget equally between model size and training data — roughly 20 tokens per parameter — and you'll beat any model that puts its money primarily into more parameters.
The Chinchilla paper is ultimately a story about humility in the face of data. The most expensive, prestigious models in the world were undertrained. The fix wasn't a better architecture or a clever trick — it was simply reading more books. There's a lesson in that for all of us.