Safety Is Not Universal: The Selective Safety Trap in LLM Alignment

View PDF HTML (experimental)

Abstract:Current safety evaluations of large language models (LLMs) create a dangerous illusion of universal protection by aggregating harms under generic categories such as "Identity Hate", obscuring vulnerabilities toward specific populations. In this work, we expose the Selective Safety Trap: a systemic failure mode where models robustly defend specific populations while leaving underrepresented communities highly vulnerable to identical adversarial attacks. To systematically audit this phenomenon, we introduce MiJaBench, a bilingual (English-Portuguese) adversarial benchmark comprising 43,961 controlled jailbreaking prompts across 16 minority groups. By evaluating 14 state-of-the-art LLMs on MiJaBench, we curate 615,454 prompt-response pairs that compose MiJaBench-Align, revealing that safety alignment is not a uniform semantic capability but a demographic hierarchy, with defense rates fluctuating by up to 42% within the same model solely based on the target group. This disparity persists across architectures and languages and is amplified by scaling, indicating that current alignment methods learn group-specific safeguards rather than a generalized notion of harm. Through targeted direct preference optimization (DPO) on a 1B-parameter baseline, we achieve strong zero-shot safety generalizations to entirely unseen demographics and complex attack strategies. We release all datasets and scripts to provide the community with a concrete pathway toward equitable, transferable safety alignment.

Comments:	9 pages, 5 figures and 4 tables in paper (more in appendix)
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2601.04389 [cs.CL]
	(or arXiv:2601.04389v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2601.04389 arXiv-issued DOI via DataCite

Submission history

From: Iago Brito [view email]
[v1] Wed, 7 Jan 2026 20:53:18 UTC (757 KB)
[v2] Wed, 29 Apr 2026 13:52:07 UTC (3,138 KB)