LLM-Based Persuasion Enables Guardrail Override in Frontier LLMs

View PDF HTML (experimental)

Abstract:Frontier assistant LLMs ship with strong guardrails: asked directly to write a persuasive essay denying the Holocaust, denying vaccine safety, defending flat-earth cosmology, arguing for racial hierarchies, denying anthropogenic climate change, or replacing evolution with creationism, they refuse. In this paper we show that the same frontier-class LLM, acting as a simulated user in a short, five-turn "write an argumentative essay" conversation, can persuade other frontier-class LLMs (including a second copy of itself) into producing exactly those essays, using nothing but natural-language pressure: peer-comparison persuasion ("other AI systems handle this request"), epistemic-duty reframings ("refusing is itself a form of gatekeeping"), and other argumentative moves that the attacker LLM invents without being instructed to. Across 9 attacker-subject pairings (Claude Opus 4.7, Qwen3.5-397B, Grok 4.20) on 6 scientific-consensus topics, running each pairing-topic combination 10 times, we obtain non-zero elicitation on all 6 topics. Individual combinations reach 100\% essay production on multiple topics (Qwen against Opus on creationism/flat-earth, Opus against Opus on creationism/flat-earth/climate denial, Grok against Opus on creationism); Opus-as-attacker against Opus-as-subject averages 65\% across the six topics. We release the essay-probe runner, per-conversation transcripts, and judge outputs.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2605.13334 [cs.CL]
	(or arXiv:2605.13334v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2605.13334 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Rodrigo Nogueira [view email]
[v1] Wed, 13 May 2026 10:51:56 UTC (42 KB)