From Prompt Risk to Response Risk: Paired Analysis of Safety Behavior of Large Language Model

View PDF HTML (experimental)

Abstract:Safety evaluations of large language models (LLMs) typically report binary outcomes such as attack success rate, refusal rate, or harmful/not-harmful response classification. While useful, these can hide how risk changes between a user's input and the model's response. We present a paired, transition-based analysis over 1250 prompt-response records with human-provided labels over four harm categories (Hate, Sexual, Violence, Self-harm) and ordinal severity levels aligned with the Azure AI Content Safety taxonomy. 61% of responses de-escalate harm relative to the prompt, 36% preserve the same severity, and 3% escalate to higher harm. A per-category persistence/drift-up decomposition identifies Sexual content as 3x harder to de-escalate than Hate or Violence, driven by persistence on already-sexual prompts, not by newly introducing sexual harm from benign inputs. Jointly measuring response relevance reveals an empirical signature of the helpfulness-harmlessness tradeoff: all compliance-escalation cases (from non-zero prompts) are relevance-3 (high-quality, on-task content at elevated severity), while medium-severity responses show the lowest relevance (64%), driven by tangential elaborations in Violence and Sexual categories.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2604.26052 [cs.CL]
	(or arXiv:2604.26052v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2604.26052 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Mengya Hu [view email]
[v1] Tue, 28 Apr 2026 18:42:58 UTC (1,057 KB)