ProactBench: Beyond What The User Asked For

View PDF HTML (experimental)

Abstract:Most LLM benchmarks score how well a model responds to explicit requests. They leave unmeasured a different conversational ability: noticing and acting on needs the user has implied but not said. We call this \emph{conversational proactivity}. ProactBench decomposes it into three phase-tied types: \textsc{Emergent}, inference from a single disclosed anchor; \textsc{Critical}, synthesis across multiple anchors; and \textsc{Recovery}, grounded forward-looking value after task completion.
We operationalise the benchmark with three agents: a Planner, a User Agent, and an Assistant Model. Their information asymmetries defend against style-confounded scoring, rubric leakage, external-context contamination, and information dumps. The released corpus contains 198 curated dialogues with 624 trigger points across 24 communication styles drawn from a psychometric inventory and audited by an independent LLM judge. Across 16 frontier and open-weight models, \textsc{Recovery} is both difficult and weakly predicted by six standard benchmarks, making it a useful new evaluation signal.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
MSC classes:	68T50, 68T07, 62-07
ACM classes:	I.2.7; I.2.6
Cite as:	arXiv:2605.09228 [cs.LG]
	(or arXiv:2605.09228v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2605.09228 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Alexander Smola [view email]
[v1] Sat, 9 May 2026 23:56:04 UTC (562 KB)