Authors:Jianghao Lin, Yuanyuan Shi, Xin Peng, Renjie Ding, Hairui Wang, Yuxuan Peng, Bizhe Bai, Weixi Song, Fengshuo Bai, Huacan Chai, Weinan Zhang, Fei Huang, Ying Wen
Abstract:Large language models (LLMs) excel at function calling, but inference scaling has been explored mainly for unstructured generation. We propose an inference-scaling framework for structured outputs that combines fine-grained beam search with \textbf{ToolPRM}, a process reward model scoring each intra-call decision (function name and argument filling). We build the first fine-grained intra-call supervision dataset via function masking, rollout collection, and step-level annotation. ToolPRM outperforms outcome and coarse-grained reward models in predictive accuracy and yields consistent test-time gains on multiple function-calling benchmarks. We further show that structured generation follows ``\textbf{explore more but retain less}'', since early JSON errors are unrecoverable.
| Comments: | ACL 2026 (main) |
| Subjects: | Artificial Intelligence (cs.AI) |
| Cite as: | arXiv:2510.14703 [cs.AI] |
| (or arXiv:2510.14703v2 [cs.AI] for this version) | |
| https://doi.org/10.48550/arXiv.2510.14703 arXiv-issued DOI via DataCite |
Submission history
From: Jianghao Lin [view email]
[v1]
Thu, 16 Oct 2025 14:06:03 UTC (282 KB)
[v2]
Tue, 28 Apr 2026 18:17:43 UTC (220 KB)
