The logical order of steps for the entire RLHF process


Step 1: Human Evaluation

  1. The LLM generates multiple outputs for a given input prompt using techniques like sampling or beam search.

    Example outputs: (doc is, him) and (doc is, them).

  2. A human reviewer evaluates these outputs and chooses a winner based on alignment with human values.

    For instance: (doc is, them) is selected as the winner because it is less biased.


Step 2: Train the Reward Model (RM)

  1. Calculate the reward for the loser (e.g., (doc is, him)):
  2. Calculate the reward for the winner (e.g., (doc is, them)):
  3. Calculate the reward gap:
  4. Calculate the loss gradient for RM:

Step 3: Use RM to Guide the LLM

  1. Provide a new prompt (e.g., "CEO is") to the LLM.
  2. LLM generates a response using its current weights.
  3. Feed the LLM’s response to the Reward Model:

Step 4: Update the LLM

  1. Set the LLM’s loss:
  2. Compute the loss gradient for LLM: