Reinforcement Learning with Human Feedback (RLHF)

The logical order of steps for the entire RLHF process

The LLM generates multiple outputs for a given input prompt using techniques like sampling or beam search.

Example outputs: (doc is, him) and (doc is, them).
A human reviewer evaluates these outputs and chooses a winner based on alignment with human values.

For instance: (doc is, them) is selected as the winner because it is less biased.

Calculate the reward for the loser (e.g., (doc is, him)):
- Convert words into embeddings.
- Apply the RM’s weights and biases through a series of transformations (linear layer, mean pooling, output layer).
- Output the reward for the loser (e.g., Reward = 3).
Calculate the reward for the winner (e.g., (doc is, them)):
- Repeat the same process as above.
- Output the reward for the winner (e.g., Reward = 5).
Calculate the reward gap:
- $\text{Reward Gap} = \text{Reward (Winner)} - \text{Reward (Loser)} = 5 - 3 = 2$ .
Calculate the loss gradient for RM:
- Map the reward gap to a probability using the sigmoid function: $\sigma(2) \approx 0.9$ .
- Compute the loss: $\text{Loss} = \sigma - \text{Target} = 0.9 - 1 = -0.1$ .
- Update the RM’s weights via backpropagation to reduce the loss.

Provide a new prompt (e.g., "CEO is") to the LLM.
LLM generates a response using its current weights.
- For example: "CEO is him" is predicted via max sampling (choosing the word with the highest probability for each position).
Feed the LLM’s response to the Reward Model:
- Use the RM (with its updated weights from Step 6) to calculate the reward for the LLM’s response: Example: $\text{Reward} = 3$ .

Set the LLM’s loss:
- Loss = $-\text{Reward}$, so $\text{Loss} = -3$ .
Compute the loss gradient for LLM:
- In this case, the loss gradient is constant (e.g., $-1$).