AI Agents · Nanjing University
DRIFT: Pinpointing Where Deep-Research Agents Go Wrong
TELBench asks models to find the span that broke a 12-step research trajectory. DRIFT audits claims against evidence, lifting macro-F1 to 54.91% with Claude-Sonnet-4.6, up to 30 points over raw inspection.