Abstract:Aligned large language models (LLMs) remain vulnerable to jailbreak attacks. Recent mechanistic studies have identified latent features and representation shifts associated with jailbreak success, but they leave a more fundamental question open: why do aligned LLMs remain jailbreakable, and what structural vulnerabilities in the model make this possible? We study this question through a continuous input-transformation view. Our theoretical finding is that aligned models can still exhibit Refusal-Escape Directions (RED): local perturbation directions around a harmful input that shift the model's behavior from refusal to answering while preserving the model's harmful-semantics interpretation. From this perspective, a jailbreak is not only a successful discrete prompt construction, but can also be understood as a refusal-to-answer behavior transition induced by continuously perturbing a harmful input along RED. We then prove that RED can be exactly decomposed into contributions from operator-level sources across the model's operator structure, and identify normalization, residual-wiring, and terminal sources as analytically constrained operator-level sources. To eliminate RED, the shared expressive modules -- self-attention and MLP -- must eliminate the contributions from these analytically constrained sources while preserving the mechanisms that support benign responses. These competing requirements give rise to a conditional safety-utility trade-off. Experiments across multiple models and attack methods empirically analyze RED from two complementary perspectives and show that added token dimensions can expose RED, while successful jailbreaks exhibit refusal-to-answer shifts largely aligned with terminal-source contributions.
| Comments: | 40 pages, 45 figures |
| Subjects: | Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) |
| Cite as: | arXiv:2605.08878 [cs.CR] |
| (or arXiv:2605.08878v1 [cs.CR] for this version) | |
| https://doi.org/10.48550/arXiv.2605.08878 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Yu Chen [view email]
[v1]
Sat, 9 May 2026 10:57:36 UTC (2,331 KB)
