Abstract:Safety evaluations of large language models (LLMs) typically report binary outcomes such as attack success rate, refusal rate, or harmful/not-harmful response classification. While useful, these can hide how risk changes between a user's input and the model's response. We present a paired, transition-based analysis over 1250 prompt-response records with human-provided labels over four harm categories (Hate, Sexual, Violence, Self-harm) and ordinal severity levels aligned with the Azure AI Content Safety taxonomy. 61% of responses de-escalate harm relative to the prompt, 36% preserve the same severity, and 3% escalate to higher harm. A per-category persistence/drift-up decomposition identifies Sexual content as 3x harder to de-escalate than Hate or Violence, driven by persistence on already-sexual prompts, not by newly introducing sexual harm from benign inputs. Jointly measuring response relevance reveals an empirical signature of the helpfulness-harmlessness tradeoff: all compliance-escalation cases (from non-zero prompts) are relevance-3 (high-quality, on-task content at elevated severity), while medium-severity responses show the lowest relevance (64%), driven by tangential elaborations in Violence and Sexual categories.
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2604.26052 [cs.CL] |
| (or arXiv:2604.26052v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2604.26052 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Mengya Hu [view email]
[v1]
Tue, 28 Apr 2026 18:42:58 UTC (1,057 KB)
