Robust Task Planning via Failure Detection Using Scene Graph From Multi-View Images

Authors

Haechan Chong

Jongwon Lee

Hyemin Ahn

Venue

RA-L

Year

2025

Paper

https://ieeexplore.ieee.org/document/11302797

Webpage

https://sites.google.com/view/scrutinize-robot-manipulation

Abstract

Recent robot task planners utilize large language models (LLMs) or vision-language models (VLMs) as a failure detector. These methods perform well by leveraging their semantic reasoning capabilities but often assume full environment understanding, which can lead to unreliable planning in complex scenes lacking explicit structural modeling. To address these limitations, we propose a novel multi-view scene understanding framework that explicitly models object-level relationships, enabling failure detection and effective task replanning. Our approach first captures multi-view images for comprehensive coverage, and generates local 2D scene graphs encoding object identities and relational information. Building on this, we introduce a model based on a graph neural network that merges the local 2D scene graphs into a unified representation. This process results in the unified scene graph, used to detect task success and identify failure causes. For each sub-task, our framework compares the unified scene graph with the expected scene graph predicted by the LLM during the task planning stage, identifying potential failure causes based on their deviations. These causes are then fed back into the LLM to facilitate effective replanning, thereby reducing repetitive failures and enhancing adaptability. We evaluate our framework on five real-world benchmark tasks to demonstrate its applicability. Separately, we compare failure detection and reasoning performance with other methods, showing the benefits of combining multi-view perception with explicit graph-based reasoning.

Webpage: https://sites.google.com/view/scrutinize-robot-manipulation