Direct Desire Optimization (DPO) is a novel algorithm launched for fine-tuning LLMs to align with human preferences with out the complexities and instabilities related to RLHF. DPO makes use of a brand new parameterization of the reward mannequin in RLHF, which permits for the extraction of the corresponding optimum coverage in closed kind. This strategy permits fixing the usual RLHF drawback with a easy classification loss, making DPO steady, performant, and computationally light-weight. It eliminates the necessity for sampling from the LM throughout fine-tuning or performing vital hyperparameter tuning.
The RLHF pipeline, consists of three foremost phases:
Supervised Tremendous-Tuning (SFT): This part begins with fine-tuning a pre-trained Language Mannequin (LM) on high-quality information related to the downstream duties. The aim is to acquire a mannequin, denoted as πSFT, that has been adjusted to carry out nicely on the precise duties of curiosity.
Desire Sampling and Reward Studying: On this part, the SFT mannequin is used to generate pairs of solutions (y1, y2) for given prompts x. These pairs are offered to human labelers who categorical a desire for one reply over the opposite, denoted as yw ≻ yl | x, the place yw is the popular and yl the dispreferred completion. The preferences are assumed to be generated by a latent reward mannequin r∗(y, x), which isn’t straight accessible. The Bradley-Terry (BT) mannequin is a well-liked selection for modeling these preferences, the place the human desire distribution p∗ is outlined as:
A reward mannequin rϕ(x, y) is then parameterized and estimated through most chance from a dataset of comparisons D, with the detrimental log-likelihood loss given by:
the place σ is the logistic operate.
RL Tremendous-Tuning Section: The realized reward operate is used to supply suggestions to the language mannequin on this part. The optimization drawback formulated is:
the place β is a parameter controlling the deviation from the bottom reference coverage πref, which is the preliminary SFT mannequin. This constraint prevents the mannequin from deviating too removed from the distribution on which the reward mannequin is correct, sustaining era range and stopping mode-collapse. The optimization is usually carried out utilizing reinforcement studying, particularly Proximal Coverage Optimization (PPO).
DPO works by straight mapping preferences to coverage optimization with out the necessity for an specific reward mannequin. That is achieved via a intelligent mathematical trick that reparameterizes the reward operate when it comes to the coverage itself. Right here’s a breakdown of the arithmetic concerned:
The strategy begins with a basic reinforcement studying goal, which goals to seek out an optimum coverage that maximizes rewards topic to a constraint on the divergence from a reference coverage. That is represented mathematically as:
the place (Z(x)) is a normalization issue referred to as the partition operate
pi_ref is the reference coverage, and (r(x, y)) is the reward operate.
Reparameterization: As an alternative of straight optimizing this reward operate, DPO reparameterizes the reward when it comes to the coverage. That is achieved by taking the logarithm of each side of the equation and rearranging it, leading to:
This reparameterization permits the optimization to be achieved straight on the coverage reasonably than the reward operate.
Desire Modeling: DPO makes use of fashions of human preferences, such because the Bradley-Terry mannequin, to precise the chance of 1 completion being most popular over one other when it comes to the optimum coverage and the reference coverage. This successfully transforms the issue into optimizing the coverage to match human preferences.
Optimization Goal: The optimization goal in DPO is to maximise the chance of the coverage given the preferences. That is represented as:
Gradient Replace: The replace rule for optimizing the coverage will increase the chance of most popular completions and reduces the chance of much less most popular ones, weighted by how a lot the implicit reward mannequin charges the unpreferred completions.
Direct Desire Optimization: Your Language Mannequin is Secretly a Reward Mannequin 2305.18290