LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

本文由 AI 分析生成

建立時間： 2026-03-27 來源： https://arxiv.org/abs/2306.03310

Summary

LIBERO is a benchmark for lifelong robot learning that separates two types of knowledge transfer: declarative (object/spatial concepts) and procedural (actions/behaviors). It provides 130 tasks across 4 task suites, plus high-quality human-teleoperated demos. Three counterintuitive findings: sequential finetuning beats dedicated lifelong learning algorithms, no single visual encoder dominates across knowledge types, and naive supervised pretraining can hurt downstream performance.

LIBERO 是終身機器人學習基準，區分兩種知識遷移：宣告式（物體/空間概念）和程序式（動作/行為）。提供 130 個任務跨 4 個任務套件與高品質人類遠端操控示範。三個反直覺發現：順序微調優於專用終身學習演算法、沒有單一視覺編碼器在所有知識類型上都表現最佳、樸素監督預訓練可能損害後續表現。

Prerequisites

Lifelong / continual learning — the benchmark is built around the challenge of learning new tasks without forgetting old ones (catastrophic forgetting); understanding this trade-off between plasticity and stability is foundational
Declarative vs. procedural knowledge — declarative: facts about the world (what objects exist, where); procedural: how to act (motor skills, behaviors). LIBERO’s 4 suites isolate these for controlled study
Behavior cloning / imitation learning — all baselines train visuomotor policies from human demonstration data; knowing BC is the standard approach helps contextualize why pretraining effects are surprising

Core Idea

LIBERO’s key design insight is that prior lifelong learning benchmarks in vision/NLP study only declarative knowledge transfer, but robot manipulation requires both declarative and procedural knowledge — and they may transfer differently. The 4-suite structure creates controlled experiments: LIBERO-Spatial (vary spatial layout, same objects/goals), LIBERO-Object (vary objects), LIBERO-Goal (vary goals), and LIBERO-100 (entangled, real-world complexity). This allows clean ablations of which knowledge type a given algorithm actually transfers. LIBERO-90/10 further supports studying pretraining + fine-tuning, which is the dominant paradigm in modern robot learning.

Results

Task suites:

Suite	Tasks	Knowledge type	Description
LIBERO-Spatial	10	Declarative (spatial)	Same objects, vary spatial layout
LIBERO-Object	10	Declarative (object)	Same spatial layout, vary objects
LIBERO-Goal	10	Procedural (goal)	Same scene, vary task goal
LIBERO-100	100	Entangled	Realistic mixed distribution
— LIBERO-90	90	Pretraining split	Used to pretrain a policy
— LIBERO-10	10	Downstream eval	Tests lifelong learning after pretraining

Key findings (qualitative — paper does not provide a single summary table):

Sequential finetuning > lifelong learning algorithms in forward transfer — the overhead of EWC, PackNet, etc. doesn’t pay off
No single visual encoder wins across all suites — ResNet, ViT, and other architectures each have strengths/weaknesses depending on knowledge type
Naive pretraining on LIBERO-90 can hurt LIBERO-10 performance — pretraining on too-similar data biases the policy, interfering with new task acquisition

Context in the field (as of 2026): LIBERO is considered “nearly saturated” — SOTA approaches score 95%+ on the standard suites, meaning new papers claiming 96-98% are not meaningfully differentiated. This is noted in the State of VLA Research at ICLR 2026.

Limitations

Author-stated: benchmark is limited to tabletop manipulation with a fixed arm (no mobile manipulation, no humanoids)
Author-stated: procedural generation pipeline is currently constrained to kitchen/tabletop objects
Unstated: by 2026 the standard suites are saturated — LIBERO-100 (especially LIBERO-10) remains more discriminative but is less commonly reported
Unstated: all tasks use a single camera viewpoint and fixed scene structure — doesn’t test visual generalization across viewpoints or lighting

Reproducibility

Code: open-source at https://github.com/Lifelong-Robot-Learning/LIBERO
Datasets: human teleoperation demos for all 130 tasks, downloadable via script or HuggingFace (yifengzhu-hf/LIBERO-datasets)
Compute: PyTorch with CUDA; compatible with standard GPU setups (paper uses 1-2 GPUs for BC baselines)

Insights

The most practically important finding is that sequential finetuning beats dedicated lifelong learning methods — this suggests that the lifelong learning research community has been optimizing for a problem (forgetting) that may not be the binding constraint in robot learning. Instead, forward transfer (learning new tasks efficiently given prior experience) matters more, and simple finetuning handles this adequately at this scale.

The pretraining-hurts result is subtle but important: it happens because pretraining on LIBERO-90 creates a strong prior toward behaviors that are similar but not identical to LIBERO-10 tasks, causing negative transfer. This is a warning for the broader trend of “pretrain on everything, fine-tune on target” — the domain shift between pretraining and target tasks needs to be managed carefully.

In 2026 LIBERO is the standard benchmark for VLA evaluation — almost every paper reports it. The saturation problem means LIBERO-100 and particularly LIBERO-10 (the lifelong learning track) are the remaining discriminative splits.

Quartz 5

Explorer

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

Summary

Prerequisites

Core Idea

Results

Limitations

Reproducibility

Insights

Connections

Graph View

Table of Contents

Backlinks