Abstract:As large language models (LLMs) are increasingly used for long-form generation, reliably evaluating long-form outputs has become a critical challenge. LLM-as-a-judge offers a scalable alternative to human evaluation, yet its reliability in long-form output evaluation remains underexamined: existing meta-evaluation benchmarks focus mainly on short-form outputs. Compared with short-form evaluation, long-form evaluation is not merely a matter of output length; it often requires judges to handle more complex document-level demands. In this work, we introduce LongJudgeBench, a comprehensive benchmark for evaluating LLM judges on long-form outputs across diverse real-world scenarios and judging protocols. We systematically evaluate a broad range of LLM judges, covering multiple base models and judging settings. Our results reveal a substantial reliability gap: current LLM judges remain unstable across scenarios, and rubrics or references are helpful but not always sufficient. We hope LongJudgeBench will support future research on more robust, context-aware, and human-aligned LLM-as-a-judge methods. Our code is available at this https URL.
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2606.01629 [cs.CL] |
| (or arXiv:2606.01629v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2606.01629 arXiv-issued DOI via DataCite |
Submission history
From: Junjie Chen [view email]
[v1]
Mon, 1 Jun 2026 03:25:34 UTC (325 KB)
