Abstract:LLMs are increasingly capable of persuasion, which raises the question of how to protect users against manipulation. In a preregistered user study (N=120) across four decision-making scenarios, we find that an adversarial LLM with a hidden goal succeeds in steering users' decisions 65.4% of the time. We then introduce a "warden" model: a secondary LLM that monitors the human-AI interaction trace in real time and issues non-binding, private advisories to the user when it detects manipulation. Adding a warden more than halves the adversary's success rate to 30.4%, with a much smaller (8.6 percentage points) reduction for genuine interactions. To probe the mechanism behind these results, we release COAX-Bench, a simulation benchmark spanning 14 decision-making scenarios, including hiring, voting, and file access. Across 16,212 simulated multi-agent interactions, capable adversarial LLMs achieve their hidden goals in 34.7% of cases, which warden models reduce to 12.3%. Notably, even warden models substantially weaker than the adversary they oversee provide meaningful protection, suggesting a path for scalable oversight of more capable models.
| Subjects: | Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA) |
| ACM classes: | I.2.0; I.2.7; J.4; H.5; K.4.2 |
| Cite as: | arXiv:2605.08321 [cs.LG] |
| (or arXiv:2605.08321v1 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2605.08321 arXiv-issued DOI via DataCite |
Submission history
From: Lennart Wachowiak [view email]
[v1]
Fri, 8 May 2026 16:23:47 UTC (959 KB)
