Summary

ScienceWorld is a benchmark for testing agents’ scientific reasoning abilities in an interactive text environment derived from elementary school science curricula. The authors find that large transformer-based models trained on static text corpora fail to generalize learned science concepts to novel experimental contexts — they can recall facts but cannot reason procedurally. A 1.5M-parameter agent trained interactively for 100k steps outperforms an 11B-parameter model trained statically on millions of demonstrations.

ScienceWorld 是一個互動文字環境的基準測試,測試智能體在小學科學課程情境下的科學推理能力。研究發現:大型靜態訓練模型能背誦事實,但無法在新環境中執行實驗推理;互動訓練的小模型遠勝靜態訓練的大模型。

Prerequisites

  • Text-based interactive environments (TWC, SciQ) — ScienceWorld is built in the TextWorld framework; understanding prior text-game benchmarks contextualizes its design choices.
  • Imitation learning vs. RL in NLP — the key experimental contrast is between models trained on demonstrations vs. models trained through environment interaction.
  • Grounded language understanding — the paper’s argument hinges on the distinction between statistical pattern matching and procedural reasoning grounded in physical simulation.

Core Idea

Standard NLP benchmarks test fact retrieval — a model can answer “copper is a conductor” by pattern-matching training text. ScienceWorld forces agents to conduct experiments: given an unknown material, devise and execute steps in a grounded simulation to determine its conductivity. This procedural grounding exposes whether models have internalized causal structure or merely memorized co-occurrences. The empirical result — a tiny interactive agent beating a massive static one — argues that environment grounding, not scale, is the missing ingredient for scientific reasoning.

Results

Metric1.5M interactive agent (100k steps)11B static model (millions of demos)
Task success rate (avg.)HigherLower
Novel context generalizationBetterPoor

(Exact numbers not present in the abstract; see full paper for per-task breakdowns.)

Limitations

  • Author-stated: the environment is text-only — visual and physical realism absent; elementary science scope only.
  • Unstated: the 11B baseline may not have been fine-tuned on the task distribution; the comparison may conflate model size with training paradigm differences.

Reproducibility

  • Code: TextWorld framework (open source); ScienceWorld environment available on GitHub.
  • Datasets: proprietary interactive environment derived from science curriculum; no standard NLP dataset.
  • Compute: 1.5M-parameter RL agent; relatively low compute; the 11B baseline is inference-only.

Insights

This paper’s strongest contribution is reframing scale as insufficient: an 11B model trained on expert demonstrations loses to a 1.5M model trained interactively. This anticipates later debate about whether LLMs trained on static internet text can reason causally — a debate that has intensified with robotics foundation models. ScienceWorld is an early signal that grounding matters more than parameters for procedural tasks.

Connections

Raw Excerpt

We find that current models cannot reason about or explain learned science concepts in novel contexts. For instance, models can easily answer what the conductivity of a known material is but struggle when asked how they would conduct an experiment in a grounded environment to find the conductivity of an unknown material.