Summary

This 2021 survey comprehensively covers scene graph generation (SGG) — producing structured graph representations of images that encode objects, their attributes, and pairwise relationships. Scene graphs bridge low-level visual recognition and high-level reasoning tasks: VQA, image captioning, image retrieval, and editing. The survey covers SGG methods both with and without prior knowledge, catalogues standard datasets, and outlines future research directions.

此 2021 年調查全面涵蓋場景圖生成(SGG)——從圖像生成結構化圖形表示,編碼物體、屬性和成對關係。場景圖橋接低層次視覺識別和高層次推理任務:VQA、圖像字幕、圖像檢索和編輯。

Prerequisites

  • Object detection — SGG builds on top of region proposal networks and bounding box regression; understanding Faster R-CNN or similar is foundational
  • Graph neural networks (GNNs) — most SGG methods use GNNs to propagate relational context across detected objects; message-passing is the core operation
  • Visual relationship detection — the direct precursor task: identifying subject-predicate-object triplets (e.g., “dog on mat”) in images
  • Knowledge graphs — prior-knowledge-guided SGG incorporates external ontologies; familiarity with RDF/OWL-style structured knowledge helps

Core Idea

Detecting objects alone is insufficient for complex visual reasoning — the relationships between objects carry most of the semantic content. Scene graphs represent images as typed graphs where nodes are object instances and edges are labeled predicates. SGG models jointly detect objects and predict their pairwise relationships, typically via region feature extraction followed by graph-based reasoning that propagates contextual information across the full scene before making final predictions.

Results

Survey paper — no primary benchmark numbers. Synthesized findings:

  • SGG with prior knowledge (language priors, external ontologies) consistently outperforms purely visual approaches
  • Visual Genome is the dominant benchmark; relationship label quality is uneven across datasets
  • Downstream tasks: image captioning and VQA with SGG intermediate representation show measurable improvement on standard benchmarks

Limitations

  • Author-stated: no systematic quantitative comparison across consistent benchmarks; field moves fast relative to survey publication cycle
  • Unstated: published 2021, predates CLIP, BLIP, GPT-4V — large vision-language models have partially absorbed SGG as an intermediate step, reducing its standalone relevance; the survey’s framing assumes SGG is a necessary module rather than an emergent property

Reproducibility

  • Code: not applicable (survey paper)
  • Datasets: Visual Genome, GQA, COCO-Stuff — all publicly available
  • Compute: not applicable

Insights

The 2021 framing — “people want higher-level understanding, not just detection” — anticipated the multimodal AI wave accurately. The irony is that the solution the field converged on (VLMs trained end-to-end on image-text pairs) bypasses explicit scene graph construction entirely. SGG research’s lasting contribution may be the benchmark datasets and the explicit relationship vocabulary rather than the pipeline itself.

Connections

Raw Excerpt

Scene graph is a structured representation of a scene that can clearly express the objects, attributes, and relationships between objects in the scene… people look forward to a higher level of understanding and reasoning about visual scenes.