Abstract:LLM-based agents deliver state-of-the-art performance across tasks but incur high end-to-end latency on edge devices. We introduce Agent-X, a software-only, accuracy-preserving framework that accelerates both the prefill and decode stages of on-device agent workloads. Agent-X's two key components rewrite prompts to leverage prefix caching tailored to agent-specific input-token patterns and enable LLM-free speculative decoding for fast token generation with minimal overhead. On representative agentic workloads, Agent-X achieves a 1.61x end-to-end speedup in real systems with no accuracy loss and can be seamlessly integrated into existing on-device AI agents. To the best of our knowledge, ours is the first to systematically characterize and eliminate latency bottlenecks in on-device agents.
| Comments: | Accepted for publication at MobiSys-2026 |
| Subjects: | Artificial Intelligence (cs.AI) |
| Cite as: | arXiv:2605.10380 [cs.AI] |
| (or arXiv:2605.10380v1 [cs.AI] for this version) | |
| https://doi.org/10.48550/arXiv.2605.10380 arXiv-issued DOI via DataCite (pending registration) |
|
| Related DOI: | https://doi.org/10.1145/3745756.3809195
DOI(s) linking to related resources |
Submission history
From: Minsoo Rhu [view email]
[v1]
Mon, 11 May 2026 11:23:38 UTC (3,344 KB)
