Direct Preference Optimization: Your Language Model is Secretly a Reward Model

The Alignment Problem

Large language models are trained on the internet. That means they've seen Shakespeare and spam, scientific papers and conspiracy theories, helpful tutorials and toxic rants. When you ask a pretrained LLM a question, it doesn't try to be helpful — it tries to predict what text would most likely come next on the internet.

This is a problem. A raw pretrained model asked "How do I pick a lock?" will happily provide detailed instructions rather than noting that this might not be the best idea. It's not malicious — it's just doing what it was trained to do: predict likely text.

Alignment is the process of taking these raw, capable-but-undirected models and making them behave the way humans actually want: helpful, harmless, and honest. But how do you teach a model something as fuzzy and subjective as "what humans prefer"?

The answer, until recently, involved a complex dance of reward models and reinforcement learning. Let's see why — and then see how DPO makes it dramatically simpler.

Click each prompt to see an example of a preferred vs rejected response. This is the data that drives alignment.

✓ Chosen (Preferred)

Click a prompt above to explore preference pairs.

✗ Rejected

These pairs are what human evaluators produce.

Human evaluators compare two model responses and pick the better one. This preference data is the foundation of alignment.

The RLHF Pipeline

Before DPO, the standard approach to alignment was Reinforcement Learning from Human Feedback (RLHF). It works, but it's a three-headed beast:

Stage 1: Supervised Fine-Tuning (SFT). You take a pretrained model and further train it on high-quality demonstration data — examples of the kind of responses you want. This gets the model into the right ballpark, but it's still imprecise.

Stage 2: Reward Model Training. You collect human comparisons — pairs of responses where humans say which is better. Then you train a separate neural network (the reward model) to score responses in a way that agrees with human judgements.

Stage 3: RL Optimization (PPO). You use Proximal Policy Optimization to fine-tune the SFT model to maximize the reward model's scores, while staying close to the SFT model (to prevent the model from "hacking" the reward function).

Each stage introduces its own complexity, hyperparameters, and failure modes. But wait — is all this complexity actually necessary?

Click each stage to learn what happens inside it. Notice how many separate models and training runs are required.

📚

Stage 1: Supervised Fine-Tuning

Train on demonstration data

↓

⚖️

Stage 2: Reward Model Training

Learn to score responses

↓

🎯

Stage 3: PPO Optimization

RL to maximize reward

The standard RLHF pipeline requires three separate training stages, two models (policy + reward), and four copies of models in memory during PPO.

Why not just use SFT? SFT alone is limited because it only learns from demonstrations (what good responses look like), not comparisons (which of two responses is better). Humans find it much easier to compare than to demonstrate — I may not be able to write a perfect poem, but I can tell you which of two poems I prefer.

III

Reward Modeling

The reward model is at the heart of RLHF. Its job is to take a prompt and a response and output a single number — a reward score — that reflects how good the response is according to human preferences.

But how do you train such a model? You use the Bradley-Terry model, a classic choice model from the 1950s. The idea is elegant: given two responses y_w (the human-preferred "winner") and y_l (the "loser"), the probability that humans prefer y_w is:

In other words, the bigger the gap between the reward scores, the more confident we are that humans prefer the winner. The reward model is trained to maximize this probability across all human comparison data.

This loss function should look familiar — it's essentially binary cross-entropy. We're training a classifier that says "yes, the human preferred this one" based on the difference in reward scores.

Drag the sliders to set reward scores for the chosen and rejected responses. Watch how the Bradley-Terry probability and loss change.

Reward(chosen): 2.0

Reward(rejected): -1.0

P(human prefers chosen) = 95.3%

Loss = 0.049

The Bradley-Terry model converts the difference in reward scores into a probability of human preference. Larger gaps → higher confidence.

The PPO Problem

So we have a reward model that can score responses. Now we need to actually use it to improve the language model. RLHF does this with Proximal Policy Optimization (PPO), a reinforcement learning algorithm.

The objective is to maximize the expected reward while staying close to the reference policy (the SFT model). Mathematically:

That KL divergence term is crucial — without it, the model would quickly learn to exploit quirks in the reward model rather than actually producing good responses. This is called reward hacking.

But PPO brings serious practical headaches:

Memory: You need four models in GPU memory simultaneously — the policy being trained, the reference policy, the reward model, and a value function.
Instability: PPO hyperparameters (clipping ratio, learning rate, batch size, number of epochs) are notoriously hard to tune for language models.
Reward hacking: Despite the KL penalty, the model can still find ways to game the reward model, producing high-scoring but low-quality outputs.
Sample inefficiency: RL generates samples, scores them, and updates — it's much slower than supervised learning.

What if there were a way to skip all of this? What if we could go directly from preference data to an aligned model?

Watch a simulated PPO training run. Notice how reward and KL divergence interact. Click "Run" to start, "Reset" to try again. Each run is different!

Reward

KL Divergence

True Quality

PPO often increases reward model scores faster than actual response quality — the gap between the blue and gold lines is "reward hacking."

A word on scale: Training GPT-4 with RLHF reportedly required a dedicated infrastructure team and months of iteration. The complexity isn't just theoretical — it's a real bottleneck for teams trying to align open-source models.

The Key DPO Insight

Here's where the magic happens. Rafailov et al. noticed something remarkable about the KL-constrained reward maximization objective we saw in Section IV.

It turns out that for any reward function r(x, y), there's a closed-form solution for the optimal policy:

Where Z(x) is a partition function (normalizing constant) that depends only on the prompt. Now here's the clever part — we can rearrange this to express the reward in terms of the policy:

This is the reparameterization at the heart of DPO. It says: the reward of any response is just β times the log-ratio of the aligned policy over the reference policy, plus a prompt-dependent constant.

Now, remember the Bradley-Terry model from Section III? It only cares about the difference in rewards between two responses to the same prompt. And when we subtract two rewards, the Z(x) terms cancel out!

This is the punchline: we can express human preference probability entirely in terms of the policy we're training and the reference policy, without ever needing an explicit reward model.

Step through the key derivation. Click the arrows to see each algebraic step.

Step 1 of 6

The partition function Z(x) cancels when we take the difference — this is what makes DPO possible.

The DPO Loss Function

Putting it all together, the DPO loss is:

Let's unpack what each piece means intuitively:

The term log π_θ(y_w|x) / π_ref(y_w|x) is the implicit reward of the chosen response. It measures how much more likely the aligned model makes the preferred response compared to the reference model. If the aligned model strongly favors the chosen response, this is a large positive number.

Similarly, log π_θ(y_l|x) / π_ref(y_l|x) is the implicit reward of the rejected response.

The loss pushes the model to increase the implicit reward of chosen responses relative to rejected ones. It's a beautifully simple objective: make your model more likely to generate preferred responses and less likely to generate rejected ones, relative to where it started.

And because it's just a classification loss (binary cross-entropy on preference pairs), you can train it with standard supervised learning. No RL. No reward model. No value function. Just gradient descent on preference data.

Adjust the implicit rewards (log-probability ratios) for chosen and rejected responses. See how the DPO loss and its gradient respond.

Implicit reward (chosen): 1.5

Implicit reward (rejected): -0.5

β · (r̂_w - r̂_l) = 2.0

σ(margin) = 88.1%

Loss = 0.127

When the implicit reward margin is large and positive (chosen >> rejected), the loss is near zero. When it's negative, the loss is high — pushing the model to fix its preferences.

An elegant property of DPO: The gradient of the DPO loss automatically weights examples by how "wrong" the model currently is. If the model already strongly prefers the chosen response, the gradient is small. If it prefers the rejected one, the gradient is large. This is implicit curriculum learning — for free.

VII

The β Parameter

The hyperparameter β controls how far the optimized policy can drift from the reference policy. It plays the same role as the KL penalty coefficient in the RLHF objective, but its effect in DPO is more intuitive.

Low β (e.g., 0.1): The model is free to deviate significantly from the reference. It can make large changes to match preferences, but risks overfitting to noise in the preference data or generating degenerate text.

High β (e.g., 0.5): The model is conservative. It only makes small adjustments from the reference. This is safer but may underfit — not fully capturing human preferences.

In practice, β = 0.1 to 0.5 works well for most tasks. The paper uses 0.1 for summarization and 0.5 for dialogue.

Drag the β slider to see how it affects the implicit reward landscape. Low β allows large deviations; high β keeps the policy close to the reference.

β = 0.1

Reference policy π_ref

Optimized policy π_θ

KL div: —

Lower β → the optimized policy (green) can shift further from the reference (blue). Higher β → the two stay close together.

VIII

DPO vs RLHF

Let's put them side by side. The theoretical result says that DPO and RLHF converge to the same optimal policy — but the path to get there couldn't be more different.

Drag the slider to compare the two approaches.

← RLHF DPO →

RLHF Pipeline

1. Train SFT model on demonstrations

2. Collect preference comparisons

3. Train separate reward model

4. Run PPO with reward model

5. KL penalty tuning

Models in memory: 4

Training stages: 3

Hyperparameters: ~12+

DPO Pipeline

1. Train SFT model on demonstrations

2. Collect preference comparisons

3. Run DPO (supervised learning!)

No reward model needed

No RL needed

Models in memory: 2

Training stages: 2

Hyperparameters: ~3

Same theoretical optimum, dramatically simpler path. DPO replaces stages 3-5 with a single supervised learning step.

The key advantages of DPO are practical:

Simplicity: It's just supervised learning. You can use your existing training infrastructure.
Stability: No RL means no reward hacking, no PPO hyperparameter tuning, no value function estimation.
Memory: You only need two model copies (policy + reference), not four.
Speed: Supervised learning is faster than RL's generate-score-update loop.

Experimental Results

The paper evaluates DPO on three tasks: controlled sentiment generation, summarization (TL;DR dataset), and single-turn dialogue (Anthropic HH dataset). The results are striking.

On summarization, DPO with best-of-N sampling achieves a higher win rate against human references than PPO, while being simpler to train. On dialogue, DPO achieves the highest win rate against the test set of all methods tested.

Perhaps most importantly, DPO achieves a better frontier on the reward-KL trade-off: for a given amount of KL divergence from the reference policy, DPO extracts more reward. In other words, it's more efficient at using its "deviation budget."

Click each benchmark to see how DPO compares against other methods.

Win rates (%) against human reference or test set. Higher is better. DPO matches or exceeds PPO across all benchmarks while being far simpler to implement.

The Bigger Picture

DPO didn't just provide a simpler training recipe. It catalyzed a paradigm shift in how the field thinks about alignment.

Within months of its publication, DPO became the default alignment method for open-source models. Zephyr-7B (HuggingFace), Intel Neural Chat, Starling-7B (Berkeley), and many others used DPO to achieve strong alignment without the complexity of PPO. It democratized alignment — suddenly, any team with preference data and standard training infrastructure could align models.

DPO also spawned a family of variants, each addressing different aspects:

IPO (Identity Preference Optimization): Addresses DPO's potential overfitting by regularizing differently.
KTO (Kahneman-Tversky Optimization): Works with unpaired data — you only need to know if individual responses are good or bad, not pairwise comparisons.
ORPO (Odds Ratio Preference Optimization): Eliminates the need for a separate SFT stage entirely.
SimPO: Uses sequence-level likelihood instead of token-level, avoiding the need for a reference model.

But perhaps the deepest insight is philosophical. DPO shows that your language model is secretly a reward model. The log-probability ratios already encode implicit preferences. We don't need a separate model to judge quality — the policy itself contains that information. We just need the right lens to extract it.

This is a recurring theme in machine learning: sometimes the complex solution and the simple solution are mathematically equivalent, but the simple one was hiding in plain sight. DPO found it by asking: "What if we just... rearranged the equation?"

Further Resources

DPO Paper — Rafailov, Sharma, Mitchell, et al. "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" (2023)
Zephyr Paper — Tunstall et al. "Zephyr: Direct Distillation of LM Alignment" (2023)
KTO Paper — Ethayarajh et al. "KTO: Model Alignment as Prospect Theoretic Optimization" (2024)
IPO Paper — Azar et al. "A General Theoretical Paradigm to Understand Learning from Human Feedback" (2023)
HuggingFace DPO Tutorial — Practical implementation guide using TRL

Contents