From r to Q∗: Your Language Model is Secretly a Q-Function
Showing that the two methods of alignment are in fact identical in some sense.
How could we align LLMs? There seems to be two disparate alignment methods: one is RLHF, which firstly learns a reward model from human preferences and then do alignment; the second is DPO, which learns from human preferences directly. It is commonly suggested that the two methods are different. But in this paper, the authors suggest the converse.
Here is the link: Your Language Model is Secretly a Q-Function