Abstract:Conventional transformer inference engines are request-driven, paying an O(n) prefill cost on every query. In streaming workloads, where data arrives continuously and queries probe an ever-growing context, this cost is prohibitive. We introduce a data-driven computational model centred on stateful sessions: a persistent KV cache advanced incrementally as new data arrives, so prefill is moved off the critical path and query latency becomes O(|q|), independent of accumulated context size. Building on this, Flash Queries reclaim idle GPU cycles between data arrivals to pre-evaluate registered questions and return cached answers before the user asks, a pattern that is structurally impossible in stateless engines because they discard intermediate state between requests. A multi-tenant continuous-batching scheduler with cell-budget admission and prefix-aware grouped prefill lets dozens of stateful sessions coexist on a single GPU while preserving full quadratic self-attention. On streaming market-data benchmarks the reference implementation achieves up to 5.9x speedup over conventional inference engines (vLLM, SGLang, TensorRT-LLM, this http URL), holding query latency constant as accumulated context grows.
| Subjects: | Machine Learning (cs.LG) |
| Cite as: | arXiv:2605.13784 [cs.LG] |
| (or arXiv:2605.13784v1 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2605.13784 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Victor Norgren [view email]
[v1]
Wed, 13 May 2026 17:06:15 UTC (54 KB)
