Abstract:Although recent tool-augmented benchmarks involve complex requests, evaluation remains limited to answer matching, neglecting critical trajectory aspects like efficiency, hallucination, and adaptivity. The most straightforward method for evaluation is to compare an agent's trajectory with the ground-truth, but annotating all valid ground-truth trajectories is prohibitively expensive. In this manner, we introduce TRACE, a reference-free framework for the multi-dimensional evaluation of tool-augmented LLMs. By incorporating an evidence bank which accumulates knowledge from preceding steps, TRACE assesses an agent's reasoning trajectory effectively. To validate our framework, we develop a new meta-evaluation dataset with diverse and flawed trajectories, each labeled with multi-faceted performance scores. Our results confirm that TRACE accurately evaluates complex trajectories even with small open-source LLMs. Furthermore, we apply our method to evaluate the trajectories that agents produce while solving tool-augmented tasks, presenting previously unreported observations and their corresponding insights.
| Comments: | International Conference on Machine Learning (ICML) 2026 |
| Subjects: | Artificial Intelligence (cs.AI); Computation and Language (cs.CL) |
| Cite as: | arXiv:2510.02837 [cs.AI] |
| (or arXiv:2510.02837v2 [cs.AI] for this version) | |
| https://doi.org/10.48550/arXiv.2510.02837 arXiv-issued DOI via DataCite |
Submission history
From: Wonjoong Kim [view email]
[v1]
Fri, 3 Oct 2025 09:19:15 UTC (2,058 KB)
[v2]
Thu, 14 May 2026 09:14:36 UTC (2,048 KB)
