Agent-X: Full Pipeline Acceleration of On-device AI Agents

View PDF HTML (experimental)

Abstract:LLM-based agents deliver state-of-the-art performance across tasks but incur high end-to-end latency on edge devices. We introduce Agent-X, a software-only, accuracy-preserving framework that accelerates both the prefill and decode stages of on-device agent workloads. Agent-X's two key components rewrite prompts to leverage prefix caching tailored to agent-specific input-token patterns and enable LLM-free speculative decoding for fast token generation with minimal overhead. On representative agentic workloads, Agent-X achieves a 1.61x end-to-end speedup in real systems with no accuracy loss and can be seamlessly integrated into existing on-device AI agents. To the best of our knowledge, ours is the first to systematically characterize and eliminate latency bottlenecks in on-device agents.

Comments:	Accepted for publication at MobiSys-2026
Subjects:	Artificial Intelligence (cs.AI)
Cite as:	arXiv:2605.10380 [cs.AI]
	(or arXiv:2605.10380v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2605.10380 arXiv-issued DOI via DataCite (pending registration)
Related DOI:	https://doi.org/10.1145/3745756.3809195 DOI(s) linking to related resources

Submission history

From: Minsoo Rhu [view email]
[v1] Mon, 11 May 2026 11:23:38 UTC (3,344 KB)