Abstract:We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models: understanding and generating plausible scene transformations driven by real-world actions. Unlike prior work that relies on prompt-based image generation and editing to perform action-conditioned image manipulation, our training hypothesis requires models to simulate the outcome of a real-world action and then reverse it to the original state. This forward-reverse requirement tests genuine cause-and-effect understanding rather than stylistic or semantic edits. We curate a high-quality benchmark of reversible actions from real-world scenarios to enable robust action grounding. Our experiments reveal that current models struggle with action reversibility, highlighting the need to evaluate action understanding. Do-Undo provides an intuitive testbed for evaluating and advancing action-aware generation in multimodal systems that must reason about real-world dynamics.
| Comments: | Project page: this https URL |
| Subjects: | Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) |
| Cite as: | arXiv:2512.13609 [cs.CV] |
| (or arXiv:2512.13609v2 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2512.13609 arXiv-issued DOI via DataCite |
Submission history
From: Shweta Mahajan [view email]
[v1]
Mon, 15 Dec 2025 18:03:42 UTC (28,685 KB)
[v2]
Thu, 14 May 2026 17:13:30 UTC (42,412 KB)
