Reinforcement Learning from Human Feedback (RLHF) is a stage in LLM training that is designed to help improve the quality of responses and reduce the risk for offensive responses. LLMs are typically trained in 3 stages: pre-training on massive text corpus with a next-word prediction task, where individual words are masked out, and the model learns to predict the next word. The second stage is supervised fine-tuning the LLM using instruction-output pairs, where a much smaller curated dataset of instructions and appropriate output text is used to fine-tune the LLM. The third, and final stage, is the use of RLHF to fine-tune the model with proximal policy optimization. A human takes the outputs (often 4 to 9 responses) and ranks the responses based on their preference. The ranking is used by the reward model to finetune the LLM. Llama 2 has two reward models - one for helpfulness and one for safety.
LLMs are trained on huge volumes of text, much of which contains toxic language, racism, sexism, and other problematic prose. LLMs can also hallucinate - they can invent answers that are not real, that is, they imagine an answer not based on real-world facts.
RLHF mitigates some of the problems of training on toxic data and on LLMs producing hallucinations.
There are competing RLHF narratives about how significant the role of RLHF is for LLMs.Yann LeCun claimed that, even with RLHF, LLMs cannot solve the problem of hallucinations - they are an inevitable by-product of the auto-regressive nature of LLMs. On the other hand, researchers, such as Kosinski, showed that the LLMs, such as GPT-4, can perform close to human level on Theory-of-Mind (ToM) tasks, indicating that LLMs can acquire beliefs and mental states.