Summary

SmolVLA is Hugging Face’s answer to the “too small to pretrain, too large to ignore” problem. At 450M parameters it sits between ACT (52M) and π₀ (3.5B), pretraining on 481 community-contributed SO-100 datasets (~23k episodes, 10.6M frames) before fine-tuning on task-specific data. The pretraining bump is substantial: 51.7% → 78.3% on SO-100 pick-place tasks. Fine-tuning requires ~50 episodes and ~4 hours on a single A100. Inputs are multi-view cameras + proprioception + language instruction; output is an action chunk.

SmolVLA 以 450M 參數介於 ACT(52M)和 π₀(3.5B)之間。在 481 個社群貢獻的 SO-100 資料集上預訓練後,SO-100 拾放任務成功率從 51.7% 提升到 78.3%。微調只需 ~50 個 episode 和單張 A100 上 ~4 小時。

Key Points

  • Architecture: multi-view cameras + proprioception state + language instruction → action expert → action chunk
  • Pretraining data: 481 datasets, ~23k episodes, 10.6M frames — primarily SO-100 demonstrations
  • Fine-tuning cost: ~50 episodes minimum; 20k steps; ~4h on A100; also available on Colab
  • Minimum viable dataset: 50 episodes per task; 25 was insufficient; ~10 per variation
  • Inference command: lerobot-record --policy.path=user/smolvla_finetuned
  • Improvement from pretraining: +26.6 percentage points over task-specific training alone

Insights

The 50-episode requirement means ~25–50 minutes of teleoperation per new task — practical for lab settings. The key design insight is that pretraining on community data from the same hardware platform (SO-100) transfers strongly. This is different from cross-embodiment transfer where gains are less reliable.

The Colab fine-tuning path matters: it means researchers without A100 access can still fine-tune SmolVLA, lowering the barrier further.

50 個 episode 要求意味每個新任務約 25-50 分鐘的遙操作 — 對實驗室環境可行。在相同硬體平台(SO-100)的社群資料上預訓練遷移效果強,這與跨體態遷移不同(後者效果較不穩定)。

Connections

Raw Excerpt

Pretraining SmolVLA on a corpus of community datasets led to a substantial improvement in real-world performance on the SO-100 robot benchmark, elevating success rates from 51.7% to 78.3%.