« Back to Glossary Index

In machine learning, Reinforcement Learning from Human Feedback (RLHF) is a technique that trains a “reward model” directly from human feedback and uses the model as a reward function to optimize an agent‘s policy using reinforcement learning (RL) through an optimization algorithm like Proximal Policy Optimization.

The reward model is trained in advance to the policy being optimized to predict if a given output is good (high reward) or bad (low reward). RLHF can improve the robustness and exploration of RL agents, especially when the reward function is sparse or noisy.

Wikpedia link for RLHF


RLHF typically involves a multi-step process:

  1. Supervised Fine-Tuning: Start by fine-tuning the model on a labeled dataset to ensure it performs adequately on the task.
  2. Feedback Collection: Collect feedback from human evaluators regarding the model’s outputs.
  3. Reward Model Training: Train a separate model to predict human feedback.
  4. Policy Optimization: Utilize reinforcement learning to adjust the model’s outputs to maximize the predicted reward from the reward model.
« Back to Glossary Index