Abstract:Building generalist embodied agents capable of solving complex real-world tasks remains a fundamental challenge in AI. Multimodal Large Language Models (MLLMs) have significantly advanced the reasoning capabilities of such agents through strong vision-language knowledge and chain-of-thought (CoT) reasoning, yet remain brittle when faced with challenging out-of-distribution scenarios. To address this, we propose Verifier-Guided Action Selection (VegAS), a test-time framework designed to improve the robustness of MLLM-based embodied agents through an explicit verification step. At inference time, rather than committing to a single decoded action, VeGAS samples an ensemble of candidate actions and uses a generative verifier to identify the most reliable choice, without modifying the underlying policy. Crucially, we find that using an MLLM off-the-shelf as a verifier yields no improvement, motivating our LLM-driven data synthesis strategy, which automatically constructs a diverse curriculum of failure cases to expose the verifier to a rich distribution of potential errors at training time. Across embodied reasoning benchmarks spanning the Habitat and ALFRED environments, VeGAS consistently improves generalization, achieving up to a 36% relative performance gain over strong CoT baselines on the most challenging multi-object, long-horizon tasks.
| Comments: | CVPR 2026 (Findings) |
| Subjects: | Artificial Intelligence (cs.AI) |
| Cite as: | arXiv:2605.12620 [cs.AI] |
| (or arXiv:2605.12620v1 [cs.AI] for this version) | |
| https://doi.org/10.48550/arXiv.2605.12620 arXiv-issued DOI via DataCite |
Submission history
From: Nishad Singhi [view email]
[v1]
Tue, 12 May 2026 18:08:24 UTC (3,261 KB)
