Abstract:Prefill attacks are an effective and low-cost jailbreaking method, as they directly insert an acceptance sequence (e.g., "Sure, here is how to...") at the start of an LLM's output and lead the model to continue the response. We make two contributions to this prior work. First, we show that an unsophisticated adversary can improve the well-known prefill attacks by ensembling a small number of prefill variants. Running three easy-to-generate prefills yields a combined attack success rate (ASR) of 22%, 90%, and 99% on Gemma-7B, Llama-3.1-8B, and Qwen3-8B respectively, an up to 38% improvement over the standard "Sure, here's..." prefill and up to 82% over our reproduction of GCG (Zou et al., 2023). Second, we introduce "sockpuppetting", a hybrid attack that optimizes an adversarial suffix placed inside the "assistant" message block of the chat template, rather than within the user prompt. The rolling variant of this attack, RollingSockpuppetGCG, increases prompt-agnostic ASR by up to 64% over our universal GCG baseline on Llama-3.1-8B. Both findings highlight the need for defences against output-prefix injection in open-weight models. Code: this https URL
| Comments: | 13 pages, 6 figures |
| Subjects: | Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG) |
| Cite as: | arXiv:2601.13359 [cs.CL] |
| (or arXiv:2601.13359v2 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2601.13359 arXiv-issued DOI via DataCite |
Submission history
From: Asen Dotsinski [view email]
[v1]
Mon, 19 Jan 2026 19:53:48 UTC (567 KB)
[v2]
Wed, 13 May 2026 09:03:13 UTC (1,080 KB)
