Abstract:Large Language Models (LLMs) have become increasingly prevalent in cloud-based platforms, propelled by the introduction of AI-based consumer and enterprise services. LLM inference requests in particular account for up to 90% of total LLM lifecycle energy use, dwarfing training energy costs. The rising volume of LLM inference requests is increasing environmental footprints, particularly carbon emissions and water consumption. To improve sustainability for LLM inference serving in cloud datacenter environments, we propose a novel multi-agent game-theoretic reinforcement learning framework called MARLIN to co-optimize time-to-first token (TTFT), carbon emissions, water usage, and energy costs associated with LLM inference. MARLIN demonstrates a reduction of at least 18% in TTFT, 33% in carbon emissions, 43% in water usage, and 11% in energy costs compared to state-of-the-art LLM inference management frameworks.
| Subjects: | Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) |
| Cite as: | arXiv:2605.13496 [cs.DC] |
| (or arXiv:2605.13496v1 [cs.DC] for this version) | |
| https://doi.org/10.48550/arXiv.2605.13496 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Sudeep Pasricha [view email]
[v1]
Wed, 13 May 2026 13:20:02 UTC (942 KB)
