Mitigating Many-shot Jailbreak Attacks with One Single Demonstration

View PDF HTML (experimental)

Abstract:Many-shot jailbreaking (MSJ) causes safety-aligned language models to answer harmful queries by preceding them with many harmful question-answer demonstrations. We study why this attack becomes stronger as the number of demonstrations increases. Empirically, we find that MSJ induces a progressive activation drift: the representation of a fixed harmful query moves step by step away from the safety-aligned region as more harmful demonstrations are added. Theoretically, we show that this drift can be interpreted as implicit malicious fine-tuning: conditioning on N harmful demonstrations induces SGD-style updates equivalent to optimizing on the corresponding N harmful samples. This view turns the attack mechanism into a defense principle. We append a fixed one-shot safety demonstration at inference time, which induces a counteracting safety-oriented update and restores refusal behavior. The resulting method improves the model's robustness to MSJ without modifying its parameters or requiring white-box access at deployment. Code is available at this https URL.

Subjects:	Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2605.08277 [cs.CR]
	(or arXiv:2605.08277v1 [cs.CR] for this version)
	https://doi.org/10.48550/arXiv.2605.08277 arXiv-issued DOI via DataCite

Submission history

From: Kejia Chen [view email]
[v1] Fri, 8 May 2026 06:33:42 UTC (1,522 KB)