The LLM generates multiple outputs for a given input prompt using techniques like sampling or beam search.
Example outputs: (doc is, him)
and (doc is, them)
.
A human reviewer evaluates these outputs and chooses a winner based on alignment with human values.
For instance: (doc is, them)
is selected as the winner because it is less biased.
(doc is, him)
):
Reward = 3
).(doc is, them)
):
Reward = 5
)."CEO is"
) to the LLM."CEO is him"
is predicted via max sampling (choosing the word with the highest probability for each position).