Abstract:Multimodal large language models (MLLMs) remain unreliable on spatial multiple-choice questions, and their failures are often attributed to poorly attended visual information. In this work, we identify a complementary failure mode, spatial lexical bias: adding a spatial relation word to the answer options can attract the model's decision and make the newly added option likely to be selected. Using nine open-weight MLLMs, we show that this phenomenon is widely observed. In particular, models can answer a binary spatial question correctly, yet consistently select an incorrect third spatial option once it is added to the answer set. We isolate such binary-stable but ternary-fragile cases as diagnostic examples and leverage mechanistic interpretability tools, revealing that a substantial part of the failure instead originates on the language side rather than the visual side: visual attention analyses and residual-stream probes show the correct spatial relation remains internally available on these failures, while irrelevant-option controls, activation patching, and sparse component interventions trace the bias to specific LLM-side channels and neurons. Based on this finding, we show that a lightweight LLM-only DPO update on tiny single-object-pair synthetic data mitigates the bias, lifting four-way robust accuracy by up to 100 points on synthetic data, and by 68.0, 32.6, and 20.1 points on broader evaluation datasets WhatsUp, SpatialMQA-Direct, and VSR.
| Subjects: | Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV) |
| Cite as: | arXiv:2606.01914 [cs.CL] |
| (or arXiv:2606.01914v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2606.01914 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Chuang Ma [view email]
[v1]
Mon, 1 Jun 2026 08:49:47 UTC (7,589 KB)
