From r to Q∗: Your Language Model is Secretly a Q-Function

Showing that the two methods of alignment are in fact identical in some sense.

04月24日, 2024

Maomei

How could we align LLMs? There seems to be two disparate alignment methods: one is RLHF, which firstly learns a reward model from human preferences and then do alignment; the second is DPO, which learns from human preferences directly. It is commonly suggested that the two methods are different. But in this paper, the authors suggest the converse.

Here is the link: Your Language Model is Secretly a Q-Function

Previous: Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention Next:Causal machine learning for predicting treatment outcomes