本文由 AI 分析生成
建立時間: 2026-03-26 來源: https://arxiv.org/abs/2104.01111
Summary
This 2021 survey comprehensively covers scene graph generation (SGG) — producing structured graph representations of images that encode objects, their attributes, and pairwise relationships. Scene graphs bridge low-level visual recognition and high-level reasoning tasks: VQA, image captioning, image retrieval, and editing. The survey covers SGG methods both with and without prior knowledge, catalogues standard datasets, and outlines future research directions.
此 2021 年調查全面涵蓋場景圖生成(SGG)——從圖像生成結構化圖形表示,編碼物體、屬性和成對關係。場景圖橋接低層次視覺識別和高層次推理任務:VQA、圖像字幕、圖像檢索和編輯。
Prerequisites
- Object detection — SGG builds on top of region proposal networks and bounding box regression; understanding Faster R-CNN or similar is foundational
- Graph neural networks (GNNs) — most SGG methods use GNNs to propagate relational context across detected objects; message-passing is the core operation
- Visual relationship detection — the direct precursor task: identifying subject-predicate-object triplets (e.g., “dog on mat”) in images
- Knowledge graphs — prior-knowledge-guided SGG incorporates external ontologies; familiarity with RDF/OWL-style structured knowledge helps
Core Idea
Detecting objects alone is insufficient for complex visual reasoning — the relationships between objects carry most of the semantic content. Scene graphs represent images as typed graphs where nodes are object instances and edges are labeled predicates. SGG models jointly detect objects and predict their pairwise relationships, typically via region feature extraction followed by graph-based reasoning that propagates contextual information across the full scene before making final predictions.
Results
Survey paper — no primary benchmark numbers. Synthesized findings:
- SGG with prior knowledge (language priors, external ontologies) consistently outperforms purely visual approaches
- Visual Genome is the dominant benchmark; relationship label quality is uneven across datasets
- Downstream tasks: image captioning and VQA with SGG intermediate representation show measurable improvement on standard benchmarks
Limitations
- Author-stated: no systematic quantitative comparison across consistent benchmarks; field moves fast relative to survey publication cycle
- Unstated: published 2021, predates CLIP, BLIP, GPT-4V — large vision-language models have partially absorbed SGG as an intermediate step, reducing its standalone relevance; the survey’s framing assumes SGG is a necessary module rather than an emergent property
Reproducibility
- Code: not applicable (survey paper)
- Datasets: Visual Genome, GQA, COCO-Stuff — all publicly available
- Compute: not applicable
Insights
The 2021 framing — “people want higher-level understanding, not just detection” — anticipated the multimodal AI wave accurately. The irony is that the solution the field converged on (VLMs trained end-to-end on image-text pairs) bypasses explicit scene graph construction entirely. SGG research’s lasting contribution may be the benchmark datasets and the explicit relationship vocabulary rather than the pipeline itself.
Connections
- Visual Question Answering (VQA)
- graph neural network
- knowledge graph
- image captioning
- visual relationship detection
Raw Excerpt
Scene graph is a structured representation of a scene that can clearly express the objects, attributes, and relationships between objects in the scene… people look forward to a higher level of understanding and reasoning about visual scenes.