Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation

View PDF HTML (experimental)

Abstract:As large language models (LLMs) are increasingly used for long-form generation, reliably evaluating long-form outputs has become a critical challenge. LLM-as-a-judge offers a scalable alternative to human evaluation, yet its reliability in long-form output evaluation remains underexamined: existing meta-evaluation benchmarks focus mainly on short-form outputs. Compared with short-form evaluation, long-form evaluation is not merely a matter of output length; it often requires judges to handle more complex document-level demands. In this work, we introduce LongJudgeBench, a comprehensive benchmark for evaluating LLM judges on long-form outputs across diverse real-world scenarios and judging protocols. We systematically evaluate a broad range of LLM judges, covering multiple base models and judging settings. Our results reveal a substantial reliability gap: current LLM judges remain unstable across scenarios, and rubrics or references are helpful but not always sufficient. We hope LongJudgeBench will support future research on more robust, context-aware, and human-aligned LLM-as-a-judge methods. Our code is available at this https URL.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2606.01629 [cs.CL]
	(or arXiv:2606.01629v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2606.01629 arXiv-issued DOI via DataCite

Submission history

From: Junjie Chen [view email]
[v1] Mon, 1 Jun 2026 03:25:34 UTC (325 KB)