Summary

EN: SkillsBench is a benchmark for evaluating how well curated “Skills” (structured documentation provided to agents) improve performance across diverse real-world tasks. Across 86 tasks in 11 domains, curated Skills improve average performance by +16.2 percentage points, with dramatic gains in specialized domains (Healthcare: +51.9pp) and modest gains in already-familiar domains (Software Engineering: +4.5pp). Self-generated Skills provide near-zero improvement. Focused 2-3 module Skills outperform comprehensive documentation, and smaller models equipped with good Skills can match larger models without them.

ZH: SkillsBench 是評估精心策劃的「技能」(提供給代理的結構化文件)在多樣化任務上效果的基準測試。在 11 個領域的 86 項任務中,精心策劃的技能平均提升 +16.2 個百分點,在專業領域有顯著提升(醫療保健 +51.9pp),在熟悉領域提升較小(軟體工程 +4.5pp)。自動生成的技能幾乎沒有提升。聚焦 2-3 模組的技能優於全面性文件,配備良好技能的小模型可比肩沒有技能的大模型。

Prerequisites

  • Understanding of LLM agent frameworks and tool use
  • Familiarity with benchmark design and evaluation methodology

Core Idea

The quality and structure of contextual documentation (“Skills”) provided to AI agents has a large and measurable impact on task performance. The key finding is that human-curated, focused Skills significantly outperform both no-documentation baselines and self-generated Skills — suggesting that well-designed human knowledge transfer to agents is currently irreplaceable, and that breadth of coverage is less important than relevance and focus.

Results

ConditionAverage Performance Gain
Curated Skills (avg)+16.2pp
Software Engineering domain+4.5pp
Healthcare domain+51.9pp
Self-generated Skills~0pp
Smaller model + curated Skills vs larger model aloneComparable

Limitations

  • 86 tasks across 11 domains — moderate scale; domain coverage may not be representative
  • Self-generated Skills evaluation methodology matters: how generation prompts are designed could affect the ~0pp result
  • Performance gains may be model-specific; tested on 7 model configurations
  • The arxiv date (2019-01-02) appears incorrect given the content references recent LLM agent research — likely a metadata artifact

Reproducibility

  • 7308 trajectories generated across 7 model configurations
  • Benchmark tasks documented in the paper
  • Code and dataset availability not confirmed from abstract alone

Connections

  • Directly validates Armin Ronacher’s SKILL.md preference over comprehensive MCP documentation
  • The healthcare +51.9pp gain matches intuition: specialized knowledge domains benefit most from structured context
  • Smaller model + Skills ≈ larger model without Skills has cost implications: you can use cheaper models with better Skills

Raw Excerpt

“Curated Skills improve average agent performance by 16.2 percentage points across diverse tasks. Notably, self-generated Skills provide near-zero improvement — suggesting that the quality and curation of Skills, not just their presence, is what drives the gain.”