SkillOpt: Training a Frozen Agent's Skill Text Like a Model
SkillOpt trains a single skill document for a frozen LLM agent with bounded add/delete/replace edits and a held-out gate, lifting GPT-5.5 by +23.5 points in direct chat across six benchmarks.
Quick answer
SkillOpt treats an LLM agent’s skill document as the only thing it trains, leaving model weights frozen. By applying bounded add/delete/replace edits to a short natural-language skill, validating each candidate on a held-out split, and keeping a budget on how much the text can change per step, it raises GPT-5.5 by an average of +23.5 points in direct chat, +24.8 with the Codex harness, and +19.1 with the Claude Code harness. The method is best or tied on all 52 evaluated (model, benchmark, harness) cells.
Why optimize the skill text, not the weights
Most “self-improving agent” work either fine-tunes weights or lets a model rewrite its own prompt with no guardrails, which drifts and forgets. SkillOpt’s framing is the useful part: the skill is the external state of a frozen agent, and that state should be trained with the same discipline as a model — batched evidence, a learning rate, and a validation gate. That reframing turns prompt-tinkering into something closer to optimization with an explicit step size, instead of an LLM endlessly rewriting a document until it overfits the last failure it saw.
How SkillOpt works
The loop is deliberately conservative. An optimizer model reads rollout trajectories — both successes and failures — and proposes structured edits to the skill document. Three controls keep it stable:
- Textual learning rate. A bounded edit budget per step (the paper finds
Lt=4works well) caps how far one skill version moves from the previous one. Moderate budgets beat unbounded rewriting (85.5 vs 84.6 on SearchQA). - Held-out gate. Candidate skills are validated on a selection split before acceptance, playing the role of a validation set so the skill cannot silently overfit the rollout batch.
- Rejected-edit buffer. Failed proposals are kept as negative feedback so the optimizer does not re-propose the same bad edit.
An epoch-wise slow/meta update preserves longer-horizon patterns rather than chasing the most recent batch. The output is a single portable skill, not a sprawling library and not a weight diff.
Key results
- Direct chat (GPT-5.5): +23.5 points average over the no-skill baseline across six benchmarks — SearchQA 77.7 to 87.3, SpreadsheetBench 41.8 to 80.7, OfficeQA 33.1 to 72.1, DocVQA 78.8 to 91.2, LiveMathematicianBench 37.6 to 66.9, ALFWorld 83.6 to 95.5.
- Codex harness: +24.8 points; Claude Code harness: +19.1 points average, both over no-skill.
- Best or tied on all 52 (model, benchmark, harness) cells, with a +5.4 point edge over an oracle that picks the best per-cell competitor among human, one-shot LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill.
- Edit economy: final skills are 379 to 1,995 tokens, reached with only 1 to 4 accepted edits per benchmark despite the large gains.
- Ablations: removing the rejected-edit buffer drops 1.6 to 4.6 points; removing the slow/meta update is the worst, a 22.5-point fall on SpreadsheetBench (77.5 to 55.0).
Why this matters now
The gains are real and the artifact is tiny — a sub-2,000-token text file moving a benchmark by tens of points is a far cheaper deployment story than fine-tuning. Just as important, the skills are reported as procedural rules (e.g., on SearchQA: infer the expected answer type from the clue, then return the shortest canonical entity), not memorized answers, which is what makes them portable across harnesses. For teams running a frozen frontier model behind an API, this is the kind of improvement you can actually ship.
Limits and open questions
SkillOpt needs scored trajectories and a held-out split, so it fits tasks with automatic verifiers, exact-match metrics, or executable checks — and offers little where “correct” cannot be scored, which is the same wall that outcome-reward RL hits. Training spends extra rollout compute and optimizer-model calls (0.6 to 46.4M training tokens per absolute test point gained); that amortizes when a skill is reused but is unattractive for one-off tasks. And by design it optimizes one portable skill rather than a library, so the authors themselves note a single skill may be too thin for highly heterogeneous domains. The evaluation also leans on GPT-5.5; how far the gains transfer to weaker base models is not the headline story here.
FAQ
What does SkillOpt actually train?
SkillOpt trains a single natural-language skill document while keeping the agent’s model weights frozen. An optimizer model proposes bounded add/delete/replace edits to that text, and only edits that survive a held-out validation gate are accepted.
How much does SkillOpt improve GPT-5.5?
SkillOpt lifts GPT-5.5 by an average of +23.5 points in direct chat across six benchmarks, +24.8 points with the Codex harness, and +19.1 points with the Claude Code harness, all measured against a no-skill baseline.
How is SkillOpt different from TextGrad or GEPA?
TextGrad and GEPA also evolve prompts, but SkillOpt adds a textual learning rate (a bounded per-step edit budget), a rejected-edit buffer, and an epoch-wise slow/meta update. With these controls it is best or tied on all 52 evaluated cells, beating an oracle that cherry-picks the best competitor per cell by +5.4 points.
When should I not use SkillOpt?
Skip SkillOpt when the task has no reliable automatic feedback — no verifier, exact-match metric, or executable check — since it relies on scored trajectories and a held-out gate. It is also a poor fit for one-off tasks where the training-token cost cannot be amortized across reuse.
One line: train the skill text like a model — bounded edits, a held-out gate, weights frozen. Read the original paper on arXiv.