Do-Undo Bench: Reversibility for Action Understanding in Image Generation

View PDF HTML (experimental)

Abstract:We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models: understanding and generating plausible scene transformations driven by real-world actions. Unlike prior work that relies on prompt-based image generation and editing to perform action-conditioned image manipulation, our training hypothesis requires models to simulate the outcome of a real-world action and then reverse it to the original state. This forward-reverse requirement tests genuine cause-and-effect understanding rather than stylistic or semantic edits. We curate a high-quality benchmark of reversible actions from real-world scenarios to enable robust action grounding. Our experiments reveal that current models struggle with action reversibility, highlighting the need to evaluate action understanding. Do-Undo provides an intuitive testbed for evaluating and advancing action-aware generation in multimodal systems that must reason about real-world dynamics.

Comments:	Project page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2512.13609 [cs.CV]
	(or arXiv:2512.13609v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2512.13609 arXiv-issued DOI via DataCite

Submission history

From: Shweta Mahajan [view email]
[v1] Mon, 15 Dec 2025 18:03:42 UTC (28,685 KB)
[v2] Thu, 14 May 2026 17:13:30 UTC (42,412 KB)