SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents

Authors:Yujiong Shen, Yajie Yang, Zhiheng Xi, Binze Hu, Huayu Sha, Jiazheng Zhang, Qiyuan Peng, Junlin Shang, Jixuan Huang, Yutao Fan, Jingqi Tong, Shihan Dou, Ming Zhang, Lei Bai, Zhenfei Yin, Tao Gui, Xingjun Ma, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang

View PDF HTML (experimental)

Abstract:Scientific reasoning inherently demands integrating sophisticated toolkits to navigate domain-specific knowledge. Yet, current benchmarks largely overlook agents' ability to orchestrate tools for such rigorous workflows. To bridge this gap, we introduce SciAgentGym, a scalable interactive environment featuring 1,780 domain-specific tools across four natural science disciplines, supported by a robust execution infrastructure. Complementing this, we present SciAgentBench, a tiered evaluation suite designed to stress-test agentic capabilities from elementary actions to long-horizon workflows. Our evaluation identifies a critical bottleneck: state-of-the-art models still struggle with complex scientific tool-use, and their performance degrades substantially as interaction horizons extend. To address this, we propose SciForge, a data synthesis method that models the tool action space as a dependency graph to generate logic-aware training trajectories. By fine-tuning on these trajectories, our SciAgent-8B outperforms the significantly larger Qwen3-VL-235B-Instruct while exhibiting positive cross-domain transfer of scientific tool-use capabilities. These results underscore the promising potential of next-generation autonomous scientific agents.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2602.12984 [cs.CL]
	(or arXiv:2602.12984v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2602.12984 arXiv-issued DOI via DataCite

Submission history

From: Yujiong Shen [view email]
[v1] Fri, 13 Feb 2026 14:58:18 UTC (1,724 KB)
[v2] Sat, 30 May 2026 12:51:16 UTC (1,727 KB)