Why Do Aligned LLMs Remain Jailbreakable: Refusal-Escape Directions, Operator-Level Sources, and Safety-Utility Trade-off

View PDF HTML (experimental)

Abstract:Aligned large language models (LLMs) remain vulnerable to jailbreak attacks. Recent mechanistic studies have identified latent features and representation shifts associated with jailbreak success, but they leave a more fundamental question open: why do aligned LLMs remain jailbreakable, and what structural vulnerabilities in the model make this possible? We study this question through a continuous input-transformation view. Our theoretical finding is that aligned models can still exhibit Refusal-Escape Directions (RED): local perturbation directions around a harmful input that shift the model's behavior from refusal to answering while preserving the model's harmful-semantics interpretation. From this perspective, a jailbreak is not only a successful discrete prompt construction, but can also be understood as a refusal-to-answer behavior transition induced by continuously perturbing a harmful input along RED. We then prove that RED can be exactly decomposed into contributions from operator-level sources across the model's operator structure, and identify normalization, residual-wiring, and terminal sources as analytically constrained operator-level sources. To eliminate RED, the shared expressive modules -- self-attention and MLP -- must eliminate the contributions from these analytically constrained sources while preserving the mechanisms that support benign responses. These competing requirements give rise to a conditional safety-utility trade-off. Experiments across multiple models and attack methods empirically analyze RED from two complementary perspectives and show that added token dimensions can expose RED, while successful jailbreaks exhibit refusal-to-answer shifts largely aligned with terminal-source contributions.

Comments:	40 pages, 45 figures
Subjects:	Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2605.08878 [cs.CR]
	(or arXiv:2605.08878v1 [cs.CR] for this version)
	https://doi.org/10.48550/arXiv.2605.08878 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Yu Chen [view email]
[v1] Sat, 9 May 2026 10:57:36 UTC (2,331 KB)