Fine-Tuning & Adaptation · T-Tech
Trust-Region Behavior Blending: A Warmup Fix for On-Policy Distillation
On-policy distillation wastes teacher supervision on a student's weak early rollouts. TRB blends teacher-like behavior inside a KL trust region during warmup, then anneals it to zero — best average on two math settings.