InstructGPT: Why Bigger Models Still Needed Human Feedback
InstructGPT showed that human preference data and RLHF could make smaller models more helpful and aligned than much larger raw language models.
Topics
Methods for steering models toward preferred, safer, or more useful behavior.
InstructGPT showed that human preference data and RLHF could make smaller models more helpful and aligned than much larger raw language models.
Alignment · Stanford University
Direct Preference Optimization turns preference tuning into a simple classification-style objective, avoiding an explicit reward model and reinforcement learning loop.