Topics

Interpretability

Reverse-engineering what neural networks compute inside — features, circuits, and the mechanisms behind model behavior.

DRIFT: Pinpointing Where Deep-Research Agents Go Wrong

TELBench asks models to find the span that broke a 12-step research trajectory. DRIFT audits claims against evidence, lifting macro-F1 to 54.91% with Claude-Sonnet-4.6, up to 30 points over raw inspection.

Reinforcement Learning · Tianjin University

Why Multi-Domain RL Forgets, and How a Math Refresh Heals It

When you RL-tune an LLM across math, code, QA, and writing in sequence, math drops from 66.49 to 57.66 even though gradients look orthogonal. A short math refresh pulls it back to 66.04 without wrecking the other three.

Interpretability · Google DeepMind

Gemma Scope: DeepMind's Open SAE Suite for Interpreting Gemma 2

Gemma Scope is a free, open suite of JumpReLU sparse autoencoders covering every layer of Gemma 2 2B and 9B (plus parts of 27B) — over 400 SAEs and 30M+ features, costing more than 20% of GPT-3's compute to train.

Interpretability · Northeastern University

Position-Aware Circuit Discovery for Language Models

This work fixes a blind spot in automatic circuit discovery: model components can matter at specific token positions, so position-invariant circuits miss real mechanisms.

Interpretability · EleutherAI

Sparse Autoencoders Find Interpretable Features in LLMs

Training a sparse autoencoder on a language model's activations pulls apart 'superposition' into single-meaning features more interpretable than neurons — and lets you edit one concept and watch behavior change.