Topics
Interpretability
Reverse-engineering what neural networks compute inside — features, circuits, and the mechanisms behind model behavior.
AI Agents · Nanjing University
TELBench asks models to find the span that broke a 12-step research trajectory. DRIFT audits claims against evidence, lifting macro-F1 to 54.91% with Claude-Sonnet-4.6, up to 30 points over raw inspection.
Reinforcement Learning · Tianjin University
When you RL-tune an LLM across math, code, QA, and writing in sequence, math drops from 66.49 to 57.66 even though gradients look orthogonal. A short math refresh pulls it back to 66.04 without wrecking the other three.
Interpretability · Google DeepMind
Gemma Scope is a free, open suite of JumpReLU sparse autoencoders covering every layer of Gemma 2 2B and 9B (plus parts of 27B) — over 400 SAEs and 30M+ features, costing more than 20% of GPT-3's compute to train.
Interpretability · Northeastern University
This work fixes a blind spot in automatic circuit discovery: model components can matter at specific token positions, so position-invariant circuits miss real mechanisms.
Interpretability · EleutherAI
Training a sparse autoencoder on a language model's activations pulls apart 'superposition' into single-meaning features more interpretable than neurons — and lets you edit one concept and watch behavior change.