Code-Centric Detection of Vulnerability-Fixing Commits: A Unified Benchmark and Empirical Study

View PDF HTML (experimental)

Abstract:Automated detection of vulnerability-fixing commits (VFCs) is critical for timely security patch deployment, as advisory databases lag patch releases by a median of 25 days and many fixes never receive advisories. We present a comprehensive evaluation of code language model based VFC detection through a unified framework consolidating over 20 fragmented datasets spanning more than 180000 commits. Across over 180 experiments with fine-tuned models from 125 M to 14 B parameters, we find no evidence that models acquire transferable security-relevant code understanding from code changes alone. When commit messages are available, they dominate model attention, and when removed, an attribution analysis shows that enriching diffs with additional intra-procedural semantic context does not shift model attention toward the code changes. Group-stratified evaluation exposes approximately 17% performance drops compared to random splits, while temporal splits on aggregated datasets prove unreliable due to compositional shift in the underlying project distributions. At a false positive rate of 0.5% all fine-tuned code-only models miss over 93% of vulnerabilities. Larger and more diverse training data or generative approaches show preliminary improvements but do not resolve the underlying limitations. To support future research on code-centric VFC detection, we release our unified framework and evaluation suite.

Subjects:	Software Engineering (cs.SE); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Cite as:	arXiv:2605.13138 [cs.SE]
	(or arXiv:2605.13138v1 [cs.SE] for this version)
	https://doi.org/10.48550/arXiv.2605.13138 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Nils Loose [view email]
[v1] Wed, 13 May 2026 08:05:14 UTC (1,900 KB)