Auto-ARGUE: LLM-Based Report Generation Evaluation

Authors:William Walden, Marc Mason, Orion Weller, Laura Dietz, John Conroy, Neil Molino, Hannah Recknor, Bryan Li, Gabrielle Kaili-May Liu, Yu Hou, Dawn Lawrie, James Mayfield, Eugene Yang

View PDF HTML (experimental)

Abstract:Generation of citation-backed reports is a primary use case for retrieval-augmented generation (RAG) systems. While open-source evaluation tools exist for various RAG tasks, tools designed for report generation are lacking. Accordingly, we introduce Auto-ARGUE, a robust LLM-based implementation of the recently proposed ARGUE framework for report generation evaluation. We present analysis of Auto-ARGUE on the report generation pilot task from the TREC 2024 NeuCLIR track and on two tasks from the TREC 2024 RAG track, showing good system-level correlations with human judgments. Additionally, we release ARGUE-Viz, a web app for visualization and fine-grained analysis of Auto-ARGUE judgments and scores.

Comments:	SIGIR 2026: Demo Track
Subjects:	Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2509.26184 [cs.IR]
	(or arXiv:2509.26184v5 [cs.IR] for this version)
	https://doi.org/10.48550/arXiv.2509.26184 arXiv-issued DOI via DataCite

Submission history

From: William Walden [view email]
[v1] Tue, 30 Sep 2025 12:41:11 UTC (4,091 KB)
[v2] Wed, 1 Oct 2025 13:05:17 UTC (4,129 KB)
[v3] Sat, 4 Oct 2025 12:48:51 UTC (4,128 KB)
[v4] Fri, 17 Oct 2025 13:06:05 UTC (4,128 KB)
[v5] Wed, 29 Apr 2026 17:14:06 UTC (4,460 KB)