Imagine you're about to spend $10 million training a language model. You have a choice: a 1-billion parameter model trained for a long time, or a 10-billion parameter model trained briefly. Which gives you better bang for your buck? Before January 2020, the answer was essentially: "Run the experiment and find out."
Then Jared Kaplan and his collaborators at OpenAI published a paper that changed the game. They discovered that language model performance follows clean, predictable power laws — and that you could forecast a model's loss before spending a single GPU-hour. This paper turned the art of training large models into something closer to engineering.
Let's dive into the most influential paper you've maybe never read — one that quietly sits behind every decision to train GPT-4, Claude, Gemini, and every other frontier model.
Section I The Punchline: Loss is Predictable
Here's the core revelation of the paper, stated plainly: if you plot language model cross-entropy loss against scale — whether that's the number of parameters, the amount of training data, or the total compute — you get a straight line on a log-log plot.
A straight line on a log-log plot means the relationship is a power law. It means that every time you multiply your scale by some factor, you get a predictable, consistent improvement in performance. Not diminishing returns (though the absolute gains do shrink). Not a cliff. Not chaos. Just… a smooth, beautiful line stretching across many orders of magnitude.
This was shocking. Neural networks are famously finicky. Training them involves millions of interacting parameters, random initialization, stochastic gradient descent, and all sorts of potential instabilities. And yet, zoom out far enough, and the final loss is governed by something as simple as y = axb.
Demo: Log-Log Plot Explorer
Toggle between linear and log-log scales to see how a power law transforms into a straight line. Drag the slider to change the exponent.
On log-log axes, a power law L = N−α becomes a straight line with slope −α.
Key insight: The specific values of the exponents tell you how fast loss decreases with scale. Kaplan et al. found α ≈ 0.076 for parameters (N), α ≈ 0.095 for data (D), and α ≈ 0.050 for compute (C). Small exponents — but they compound over orders of magnitude.
Section II Three Knobs of Scale
When you train a language model, there are exactly three dials you can turn to make it more powerful. Kaplan et al. studied each one in isolation and in combination:
N — Number of Parameters. This is the model's "size." A model with 100 million parameters has 100 million learnable weights. GPT-3 has 175 billion. More parameters means more capacity to store and generalize patterns.
D — Dataset Size (in tokens). This is how much text the model trains on. A token is roughly ¾ of a word. More data means more examples for the model to learn from. WebText had ~8 billion tokens; modern datasets have trillions.
C — Compute Budget (in FLOPs). This is the total computational work: roughly C ≈ 6ND for a Transformer (6 floating-point operations per parameter per token). It's the product of how big the model is and how long you train it.
Interactive: The Three Knobs of Scale
Adjust each knob and see how they relate. The compute budget C ≈ 6·N·D links all three.
The fundamental equation: C ≈ 6ND links parameters, data, and compute.
The beauty of Kaplan et al.'s framework is that it treats each of these knobs both independently and jointly. Given a fixed compute budget C, how should you split it between N and D? That's the central allocation question — and the answer surprised everyone.
Section III Power Laws — Nature's Favorite Pattern
Before we unpack the specific scaling laws, let's build intuition for power laws themselves. A power law has the form:
Power laws are everywhere in nature: earthquake magnitudes (Gutenberg-Richter law), city population distributions (Zipf's law), even word frequencies in language. The hallmark is scale invariance — the same relationship holds whether you're looking at small scales or large ones.
Why do they appear in neural network scaling? The honest answer is: we don't fully know. There are theoretical arguments about feature learning, information-theoretic limits, and the structure of natural language. But the empirical fact is undeniable — the fits are remarkably clean across 7+ orders of magnitude.
Prediction Challenge: Spot the Power Law
Below are four curves on a log-log plot. Which one represents a true power law? (Hint: power laws are perfectly straight on log-log axes.)
The critical thing about power laws versus, say, exponential decay is that power laws never "bottom out" or "plateau" — they keep improving, just more and more slowly. This is both encouraging (more scale always helps) and humbling (you need a lot more scale for each incremental gain).
Section IV Scaling in Parameters: L(N)
The first scaling law says: when you make a model bigger (more parameters) while giving it plenty of data, loss decreases as a power law in N.
What this means in practice: if you 10× the number of parameters, you reduce the loss by a factor of 100.076 ≈ 1.19 — roughly a 19% reduction in the "reducible" loss component. That might not sound like much, but compound it over several orders of magnitude and you go from a model that babbles to one that writes essays.
Crucially, this relationship is independent of model architecture details. Whether you use 12 layers or 48, wider or narrower, the loss mostly depends on the total parameter count. This was a surprising finding — it suggests there's something fundamental about capacity, not architecture, that drives performance.
Interactive: Scale Your Model
Drag the slider to increase model parameters and watch the predicted loss drop along the L(N) power law curve.
The smooth curve shows L(N) ∝ N−0.076. Each point represents a model of that size trained to convergence.
Why nats? The loss is measured in "nats" — natural units of information (using natural log instead of log-2). 1 nat ≈ 1.44 bits. A loss of 3.0 nats means the model's per-token perplexity is e3 ≈ 20 — it considers roughly 20 tokens equally likely at each position.
Section V Scaling in Data: L(D)
The second law: when you increase the amount of training data while using a large enough model, loss decreases as a power law in D.
Data scales slightly faster than parameters (0.095 > 0.076). That means, unit-for-unit, more data gives you a bit more bang than more parameters. But there's a catch: high-quality text data is finite. At some point, you start running into data walls — a problem that has indeed become relevant by 2025.
The "large enough model" caveat is important. If your model is too small for your dataset, it bottlenecks — it can't absorb all the patterns in the data. The scaling law for D assumes you've removed this bottleneck.
Interactive: Data Scaling Explorer
Set a model size, then drag the data slider to see how loss decreases. Notice how small models "plateau" while large models keep improving.
Smaller models hit a "wall" — they can't absorb more data. Larger models keep improving.
Section VI Scaling in Compute: L(C)
The third and perhaps most practically important law: if you optimally allocate a fixed compute budget C between model size and data, loss decreases as a power law in C.
The exponent is smaller (0.050 vs 0.076 or 0.095), which makes sense — compute is the "joint" variable that combines N and D. But the key point is that this law tells you: "If I have 10× more compute, here's how much better my model will be." That's incredibly valuable for planning.
Think about what this means for a lab like OpenAI, Google, or Anthropic. Before this paper, deciding to spend $100M on training was a leap of faith. After this paper, it's a calculable investment: you can estimate the loss you'll achieve before you even begin training. That's a seismic shift.
Interactive: Compute Budget Calculator
Set your compute budget and see the predicted loss. Compare it against known models.
| Model (approx) | Compute (FLOPs) | Loss (nats) |
|---|---|---|
| GPT-2 Small | ~2 × 10¹⁹ | ~3.3 |
| GPT-2 XL | ~3 × 10²⁰ | ~2.9 |
| GPT-3 13B | ~5 × 10²² | ~2.4 |
| GPT-3 175B | ~3 × 10²³ | ~2.1 |
Each 10× increase in compute yields a roughly 12% reduction in loss.
Section VII Larger Models Are More Sample-Efficient
Here's one of the paper's most surprising and important findings: bigger models reach the same loss with less data. In other words, larger models are more sample-efficient.
Think of it like this: a student with a bigger brain (more parameters) can learn from fewer examples. A small model might need to see a billion sentences to learn subject-verb agreement; a large model grasps it after seeing far fewer. This isn't just convenient — it has deep implications for data requirements.
Specifically, to reach a given loss level, the number of data points needed scales as D ∝ N0.74. So if you 10× the model size, you only need about 100.74 ≈ 5.5× the data to reach the same loss level with optimal training. You don't need 10× the data — big models squeeze more information from each example.
Interactive: Sample Efficiency Comparison
Watch how a larger model achieves a target loss using fewer training tokens. Click "Step" to advance through the training process.
Larger models learn faster — they extract more information per token.
The data wall problem: This finding is why modern labs obsess over data. Even with sample efficiency advantages, models at the frontier need trillions of tokens. High-quality text data is genuinely running out — which is why synthetic data, multimodal data, and careful data curation have become critical research areas.
Section VIII The Optimal Allocation of Compute
Now the big question: given a fixed compute budget, how should you split it between model size (N) and training data (D)?
Kaplan et al.'s answer was striking: most of the budget should go to making the model bigger, not training longer. Specifically, they found that with optimal allocation, as compute C increases:
Read that again: the exponent for model size (0.73) is nearly three times the exponent for data (0.27). This means that when you get 10× more compute, you should increase your model size by ~5.4× but only increase your data by ~1.9×.
This was the recommendation that directly influenced the design of GPT-3. It's also the recommendation that was later challenged by the Chinchilla paper (Hoffmann et al., 2022), which found a more balanced split. But we'll get to that. First, let's explore the allocation interactively.
Interactive: Compute Allocation Optimizer
Given a compute budget, drag the slider to explore different N/D allocations. The loss curve shows which split is optimal. Compare the Kaplan and Chinchilla prescriptions.
The loss valley shows the optimal split. Kaplan favors larger models; Chinchilla favors more data.
Kaplan vs. Chinchilla: Kaplan et al. (2020) recommended allocating ~73% of compute-scaling to parameters. Hoffmann et al. (2022, "Chinchilla") later found the optimal split is closer to 50-50 (N ∝ C0.5, D ∝ C0.5). The difference came from fixing a methodology issue in how models were compared. Both are landmark results — together they bracket the optimal strategy.
Section IX Early Stopping & the Convergence Tradeoff
Here's a practical consequence of the scaling laws that matters enormously for training efficiency: you don't need to train a model to full convergence to benefit from its size advantage.
Kaplan et al. found that the loss curves for different model sizes are roughly parallel on a log scale. A larger model starts at a lower loss and improves at roughly the same rate per step. This means you can early-stop a large model and still beat a smaller model trained for much longer.
This insight is critical for budget-constrained training. If you can't afford to train a big model for a long time, it might still be worth starting a big model and stopping early — rather than training a small model exhaustively. You get a better model per FLOP spent.
Interactive: Early Stopping Simulator
Watch two models train: a small model trained to convergence vs. a large model early-stopped at the same compute budget. Which achieves lower loss?
The large model wins even with early stopping — because bigger models are more compute-efficient.
This result has a delightful analogy: it's like choosing between a compact car driven across the country versus a sports car driven halfway. If "distance per gallon" is better in the sports car, even stopping early gets you further. Bigger models have better "loss per FLOP."
Section X Planning a Training Run
Let's put it all together. Suppose you're the head of a training team and you have a specific compute budget. How do you use the scaling laws to plan your run?
Step 1: Determine your compute budget C (based on your GPU cluster and time). Step 2: Use the optimal allocation formula to compute Nopt and Dopt. Step 3: Predict your expected loss. Step 4: Validate with a small-scale pilot.
The scaling laws also let you extrapolate from small runs. Train a few small models, fit the power-law curve, and predict what a much larger model will achieve. This "scaling curve" approach has become standard practice at every major lab.
Interactive: Training Run Planner
Enter your constraints and get a recommended training plan based on the scaling laws.
Adjust your hardware constraints and see the recommended training configuration.
MFU (Model FLOPs Utilization) is the fraction of theoretical peak compute you actually use. In practice, communication overhead, memory bottlenecks, and pipeline bubbles reduce this. 30–45% is typical for large training runs. Palm achieved ~46%, which was considered impressive.
Section XI How This Paper Changed Everything
Before the scaling laws paper, training large language models was part science, part alchemy. Researchers would try various architectures, dataset mixtures, and training schedules, often relying on intuition and small-scale experiments that didn't necessarily predict large-scale behavior.
After this paper, the field underwent a philosophical shift. The central question changed from "What's the right architecture?" to "How do we get more compute?" If performance is a smooth function of scale, and you know the function, the path to better AI is brutally simple: scale up.
This paper is a direct ancestor of the "scaling hypothesis" — the bet that scaling alone (with known architectures and data) is sufficient to reach increasingly general AI capabilities. It's the intellectual foundation that justified billion-dollar GPU clusters and multi-month training runs.
It also spawned a rich follow-up literature: Chinchilla refined the optimal allocation formula. Scaling laws for downstream tasks (like coding or math) showed similar patterns. Emergent abilities research (Wei et al., 2022) explored where scaling laws "break" — where sudden capabilities appear at specific scale thresholds.
Interactive: Timeline of Scaling
Click on each milestone to see how scaling laws influenced the trajectory of AI development.
The scaling laws paper set the stage for the modern era of large language models.
Of course, scaling laws have limitations. They say nothing about what capabilities emerge at what scale. They don't capture the effect of data quality, RLHF, or instruction tuning. They assume a fixed architecture family. And the exponents may shift as we approach data ceilings or discover new training techniques.
But the core insight remains: more scale, more predictably. And that insight alone was worth billions.
Prediction Challenge: Test Your Intuition
Section XII Summary & Further Reading
Key Takeaways
- Power-law scaling: Loss decreases as a smooth power law in parameters (N), data (D), and compute (C), holding across 7+ orders of magnitude.
- Three exponents: αN ≈ 0.076, αD ≈ 0.095, αC ≈ 0.050. Small numbers with enormous consequences at scale.
- Architecture-independence: Total parameter count matters more than specific architectural choices (depth vs. width).
- Sample efficiency: Larger models learn more from each training example. They're not just bigger — they're better learners.
- Optimal allocation: Kaplan recommended N ∝ C0.73 (later revised by Chinchilla to ~C0.5).
- Early stopping: A large, early-stopped model beats a small, fully-converged model at the same compute budget.
- Predictability: You can forecast model performance from small pilot runs, making multi-million-dollar training decisions calculable.
- Paradigm shift: This paper turned "how to make AI better" from an architecture question into a scaling question.
Further Resources
-
Scaling Laws for Neural Language Models (Kaplan et al., 2020)
The original paper. Dense but readable. The plots alone are worth the read.
-
Training Compute-Optimal LLMs "Chinchilla" (Hoffmann et al., 2022)
The key follow-up that revised the optimal N/D allocation to a more balanced split.
-
Emergent Abilities of Large Language Models (Wei et al., 2022)
Explores where smooth scaling laws meet sudden capability jumps.
-
Scaling Data-Constrained Language Models (Muennighoff et al., 2023)
What happens when you run out of unique data? Explores repeated data and multi-epoch training.
-
The Scaling Hypothesis — Gwern Branwen
An excellent long-form essay putting scaling laws in broader context.