Think Twice, Act Once: Verifier-Guided Action Selection For Embodied Agents

View PDF HTML (experimental)

Abstract:Building generalist embodied agents capable of solving complex real-world tasks remains a fundamental challenge in AI. Multimodal Large Language Models (MLLMs) have significantly advanced the reasoning capabilities of such agents through strong vision-language knowledge and chain-of-thought (CoT) reasoning, yet remain brittle when faced with challenging out-of-distribution scenarios. To address this, we propose Verifier-Guided Action Selection (VegAS), a test-time framework designed to improve the robustness of MLLM-based embodied agents through an explicit verification step. At inference time, rather than committing to a single decoded action, VeGAS samples an ensemble of candidate actions and uses a generative verifier to identify the most reliable choice, without modifying the underlying policy. Crucially, we find that using an MLLM off-the-shelf as a verifier yields no improvement, motivating our LLM-driven data synthesis strategy, which automatically constructs a diverse curriculum of failure cases to expose the verifier to a rich distribution of potential errors at training time. Across embodied reasoning benchmarks spanning the Habitat and ALFRED environments, VeGAS consistently improves generalization, achieving up to a 36% relative performance gain over strong CoT baselines on the most challenging multi-object, long-horizon tasks.

Comments:	CVPR 2026 (Findings)
Subjects:	Artificial Intelligence (cs.AI)
Cite as:	arXiv:2605.12620 [cs.AI]
	(or arXiv:2605.12620v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2605.12620 arXiv-issued DOI via DataCite

Submission history

From: Nishad Singhi [view email]
[v1] Tue, 12 May 2026 18:08:24 UTC (3,261 KB)