SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

本文由 AI 分析生成

建立時間： 2019-01-02

Summary

EN: SkillsBench is a benchmark for evaluating how well curated “Skills” (structured documentation provided to agents) improve performance across diverse real-world tasks. Across 86 tasks in 11 domains, curated Skills improve average performance by +16.2 percentage points, with dramatic gains in specialized domains (Healthcare: +51.9pp) and modest gains in already-familiar domains (Software Engineering: +4.5pp). Self-generated Skills provide near-zero improvement. Focused 2-3 module Skills outperform comprehensive documentation, and smaller models equipped with good Skills can match larger models without them.

ZH: SkillsBench 是評估精心策劃的「技能」（提供給代理的結構化文件）在多樣化任務上效果的基準測試。在 11 個領域的 86 項任務中，精心策劃的技能平均提升 +16.2 個百分點，在專業領域有顯著提升（醫療保健 +51.9pp），在熟悉領域提升較小（軟體工程 +4.5pp）。自動生成的技能幾乎沒有提升。聚焦 2-3 模組的技能優於全面性文件，配備良好技能的小模型可比肩沒有技能的大模型。

Prerequisites

Understanding of LLM agent frameworks and tool use
Familiarity with benchmark design and evaluation methodology

Core Idea

The quality and structure of contextual documentation (“Skills”) provided to AI agents has a large and measurable impact on task performance. The key finding is that human-curated, focused Skills significantly outperform both no-documentation baselines and self-generated Skills — suggesting that well-designed human knowledge transfer to agents is currently irreplaceable, and that breadth of coverage is less important than relevance and focus.

Results

Condition	Average Performance Gain
Curated Skills (avg)	+16.2pp
Software Engineering domain	+4.5pp
Healthcare domain	+51.9pp
Self-generated Skills	~0pp
Smaller model + curated Skills vs larger model alone	Comparable

Limitations

86 tasks across 11 domains — moderate scale; domain coverage may not be representative
Self-generated Skills evaluation methodology matters: how generation prompts are designed could affect the ~0pp result
Performance gains may be model-specific; tested on 7 model configurations
The arxiv date (2019-01-02) appears incorrect given the content references recent LLM agent research — likely a metadata artifact

Reproducibility

7308 trajectories generated across 7 model configurations
Benchmark tasks documented in the paper
Code and dataset availability not confirmed from abstract alone

Connections

Directly validates Armin Ronacher’s SKILL.md preference over comprehensive MCP documentation
The healthcare +51.9pp gain matches intuition: specialized knowledge domains benefit most from structured context
Smaller model + Skills ≈ larger model without Skills has cost implications: you can use cheaper models with better Skills

Raw Excerpt

“Curated Skills improve average agent performance by 16.2 percentage points across diverse tasks. Notably, self-generated Skills provide near-zero improvement — suggesting that the quality and curation of Skills, not just their presence, is what drives the gain.”

bot_vault

Explorer

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Summary

Prerequisites

Core Idea

Results

Limitations

Reproducibility

Connections

Raw Excerpt

Graph View

Table of Contents

Backlinks