Institution

University of Science and Technology of China

University of Science and Technology of China (USTC), a leading Chinese research university active in machine learning and AI agents.

Reinforcement Learning · Alibaba Qwen Team

APPO: Agentic Procedural Policy Optimization for RL Agents

APPO branches RL rollouts at high-uncertainty, high-influence tokens instead of tool-call boundaries, lifting Qwen2.5-7B by 3.9 points over ARPO across 13 math, multi-hop, and deep-search benchmarks.

AI Agents · University of Science and Technology of China

Role-Agent: One LLM Plays Both Agent and Its Own Environment

Role-Agent makes a single LLM act as agent and environment at once, generating its own process reward and curriculum. It beats GiGPO by 4.2% on ALFWorld and 6.9% on WebShop with Qwen2.5-1.5B.

Text-to-Image · University of Science and Technology of China

Flow-OPD: On-Policy Distillation Fixes Reward Conflict in Text-to-Image RL

Flow-OPD trains one specialist teacher per reward, then distills them on-policy into one SD 3.5 student — lifting GenEval 0.63 to 0.92 and OCR 0.59 to 0.94 without the aesthetic collapse of multi-reward GRPO.

AI Agents · University of Science and Technology of China

Skill1: One RL Policy That Selects, Uses, and Distills Agent Skills

Skill1 trains a single Qwen2.5-7B policy to retrieve, apply, and create reusable skills under one task-outcome reward — reaching 97.5% on ALFWorld, 6.5 points over the strongest RL-only baseline.

Diffusion Models · University of Science and Technology of China

Stream-R1: Reliability-Perplexity Aware Reward Distillation Explained

Stream-R1 reweights DMD losses by video reward scores and per-region perplexity instead of treating signals equally. Its 1.3B streaming model hits 84.40 VBench at 23.1 FPS, beating its 14B teacher's 84.26 for free.

Diffusion Models · University of Science and Technology of China

Stream-T1: Test-Time Scaling for Streaming Video Generation

Stream-T1 adds test-time search to streaming video generation without retraining, lifting VideoAlign motion quality from 0.350 to 0.629 at 5s and cutting the drift that wrecks 30-second clips.