Reinforcement Learning · Alibaba Qwen Team
Z-Reward: Internalizing Reasoning into Score Distributions for T2I
Z-Reward predicts a distribution over rubric scores instead of one scalar. A 9B student hits 88.6% human-preference accuracy with a single output token, and downstream T2I tuning gains 41.3% net GSB over SFT.