The Alignment Problem
Large language models are trained on the internet. That means they've seen Shakespeare and spam, scientific papers and conspiracy theories, helpful tutorials and toxic rants. When you ask a pretrained LLM a question, it doesn't try to be helpful — it tries to predict what text would most likely come next on the internet.
This is a problem. A raw pretrained model asked "How do I pick a lock?" will happily provide detailed instructions rather than noting that this might not be the best idea. It's not malicious — it's just doing what it was trained to do: predict likely text.
Alignment is the process of taking these raw, capable-but-undirected models and making them behave the way humans actually want: helpful, harmless, and honest. But how do you teach a model something as fuzzy and subjective as "what humans prefer"?
The answer, until recently, involved a complex dance of reward models and reinforcement learning. Let's see why — and then see how DPO makes it dramatically simpler.
Click each prompt to see an example of a preferred vs rejected response. This is the data that drives alignment.
The RLHF Pipeline
Before DPO, the standard approach to alignment was Reinforcement Learning from Human Feedback (RLHF). It works, but it's a three-headed beast:
Stage 1: Supervised Fine-Tuning (SFT). You take a pretrained model and further train it on high-quality demonstration data — examples of the kind of responses you want. This gets the model into the right ballpark, but it's still imprecise.
Stage 2: Reward Model Training. You collect human comparisons — pairs of responses where humans say which is better. Then you train a separate neural network (the reward model) to score responses in a way that agrees with human judgements.
Stage 3: RL Optimization (PPO). You use Proximal Policy Optimization to fine-tune the SFT model to maximize the reward model's scores, while staying close to the SFT model (to prevent the model from "hacking" the reward function).
Each stage introduces its own complexity, hyperparameters, and failure modes. But wait — is all this complexity actually necessary?
Click each stage to learn what happens inside it. Notice how many separate models and training runs are required.
Reward Modeling
The reward model is at the heart of RLHF. Its job is to take a prompt and a response and output a single number — a reward score — that reflects how good the response is according to human preferences.
But how do you train such a model? You use the Bradley-Terry model, a classic choice model from the 1950s. The idea is elegant: given two responses yw (the human-preferred "winner") and yl (the "loser"), the probability that humans prefer yw is:
In other words, the bigger the gap between the reward scores, the more confident we are that humans prefer the winner. The reward model is trained to maximize this probability across all human comparison data.
This loss function should look familiar — it's essentially binary cross-entropy. We're training a classifier that says "yes, the human preferred this one" based on the difference in reward scores.
Drag the sliders to set reward scores for the chosen and rejected responses. Watch how the Bradley-Terry probability and loss change.
The PPO Problem
So we have a reward model that can score responses. Now we need to actually use it to improve the language model. RLHF does this with Proximal Policy Optimization (PPO), a reinforcement learning algorithm.
The objective is to maximize the expected reward while staying close to the reference policy (the SFT model). Mathematically:
That KL divergence term is crucial — without it, the model would quickly learn to exploit quirks in the reward model rather than actually producing good responses. This is called reward hacking.
But PPO brings serious practical headaches:
- Memory: You need four models in GPU memory simultaneously — the policy being trained, the reference policy, the reward model, and a value function.
- Instability: PPO hyperparameters (clipping ratio, learning rate, batch size, number of epochs) are notoriously hard to tune for language models.
- Reward hacking: Despite the KL penalty, the model can still find ways to game the reward model, producing high-scoring but low-quality outputs.
- Sample inefficiency: RL generates samples, scores them, and updates — it's much slower than supervised learning.
What if there were a way to skip all of this? What if we could go directly from preference data to an aligned model?
Watch a simulated PPO training run. Notice how reward and KL divergence interact. Click "Run" to start, "Reset" to try again. Each run is different!
The Key DPO Insight
Here's where the magic happens. Rafailov et al. noticed something remarkable about the KL-constrained reward maximization objective we saw in Section IV.
It turns out that for any reward function r(x, y), there's a closed-form solution for the optimal policy:
Where Z(x) is a partition function (normalizing constant) that depends only on the prompt. Now here's the clever part — we can rearrange this to express the reward in terms of the policy:
This is the reparameterization at the heart of DPO. It says: the reward of any response is just β times the log-ratio of the aligned policy over the reference policy, plus a prompt-dependent constant.
Now, remember the Bradley-Terry model from Section III? It only cares about the difference in rewards between two responses to the same prompt. And when we subtract two rewards, the Z(x) terms cancel out!
This is the punchline: we can express human preference probability entirely in terms of the policy we're training and the reference policy, without ever needing an explicit reward model.
Step through the key derivation. Click the arrows to see each algebraic step.
The DPO Loss Function
Putting it all together, the DPO loss is:
Let's unpack what each piece means intuitively:
The term log πθ(yw|x) / πref(yw|x) is the implicit reward of the chosen response. It measures how much more likely the aligned model makes the preferred response compared to the reference model. If the aligned model strongly favors the chosen response, this is a large positive number.
Similarly, log πθ(yl|x) / πref(yl|x) is the implicit reward of the rejected response.
The loss pushes the model to increase the implicit reward of chosen responses relative to rejected ones. It's a beautifully simple objective: make your model more likely to generate preferred responses and less likely to generate rejected ones, relative to where it started.
And because it's just a classification loss (binary cross-entropy on preference pairs), you can train it with standard supervised learning. No RL. No reward model. No value function. Just gradient descent on preference data.
Adjust the implicit rewards (log-probability ratios) for chosen and rejected responses. See how the DPO loss and its gradient respond.
The β Parameter
The hyperparameter β controls how far the optimized policy can drift from the reference policy. It plays the same role as the KL penalty coefficient in the RLHF objective, but its effect in DPO is more intuitive.
Low β (e.g., 0.1): The model is free to deviate significantly from the reference. It can make large changes to match preferences, but risks overfitting to noise in the preference data or generating degenerate text.
High β (e.g., 0.5): The model is conservative. It only makes small adjustments from the reference. This is safer but may underfit — not fully capturing human preferences.
In practice, β = 0.1 to 0.5 works well for most tasks. The paper uses 0.1 for summarization and 0.5 for dialogue.
Drag the β slider to see how it affects the implicit reward landscape. Low β allows large deviations; high β keeps the policy close to the reference.
DPO vs RLHF
Let's put them side by side. The theoretical result says that DPO and RLHF converge to the same optimal policy — but the path to get there couldn't be more different.
Drag the slider to compare the two approaches.
The key advantages of DPO are practical:
- Simplicity: It's just supervised learning. You can use your existing training infrastructure.
- Stability: No RL means no reward hacking, no PPO hyperparameter tuning, no value function estimation.
- Memory: You only need two model copies (policy + reference), not four.
- Speed: Supervised learning is faster than RL's generate-score-update loop.
Experimental Results
The paper evaluates DPO on three tasks: controlled sentiment generation, summarization (TL;DR dataset), and single-turn dialogue (Anthropic HH dataset). The results are striking.
On summarization, DPO with best-of-N sampling achieves a higher win rate against human references than PPO, while being simpler to train. On dialogue, DPO achieves the highest win rate against the test set of all methods tested.
Perhaps most importantly, DPO achieves a better frontier on the reward-KL trade-off: for a given amount of KL divergence from the reference policy, DPO extracts more reward. In other words, it's more efficient at using its "deviation budget."
Click each benchmark to see how DPO compares against other methods.
The Bigger Picture
DPO didn't just provide a simpler training recipe. It catalyzed a paradigm shift in how the field thinks about alignment.
Within months of its publication, DPO became the default alignment method for open-source models. Zephyr-7B (HuggingFace), Intel Neural Chat, Starling-7B (Berkeley), and many others used DPO to achieve strong alignment without the complexity of PPO. It democratized alignment — suddenly, any team with preference data and standard training infrastructure could align models.
DPO also spawned a family of variants, each addressing different aspects:
- IPO (Identity Preference Optimization): Addresses DPO's potential overfitting by regularizing differently.
- KTO (Kahneman-Tversky Optimization): Works with unpaired data — you only need to know if individual responses are good or bad, not pairwise comparisons.
- ORPO (Odds Ratio Preference Optimization): Eliminates the need for a separate SFT stage entirely.
- SimPO: Uses sequence-level likelihood instead of token-level, avoiding the need for a reference model.
But perhaps the deepest insight is philosophical. DPO shows that your language model is secretly a reward model. The log-probability ratios already encode implicit preferences. We don't need a separate model to judge quality — the policy itself contains that information. We just need the right lens to extract it.
This is a recurring theme in machine learning: sometimes the complex solution and the simple solution are mathematically equivalent, but the simple one was hiding in plain sight. DPO found it by asking: "What if we just... rearranged the equation?"
Further Resources
- DPO Paper — Rafailov, Sharma, Mitchell, et al. "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" (2023)
- Zephyr Paper — Tunstall et al. "Zephyr: Direct Distillation of LM Alignment" (2023)
- KTO Paper — Ethayarajh et al. "KTO: Model Alignment as Prospect Theoretic Optimization" (2024)
- IPO Paper — Azar et al. "A General Theoretical Paradigm to Understand Learning from Human Feedback" (2023)
- HuggingFace DPO Tutorial — Practical implementation guide using TRL