Abstract:The optimal kernel configuration for Mixture-of-Experts (MoE) inference depends on both batch size and the expert routing distribution, yet production systems dispatch from batch size alone, leaving 10-70% of kernel throughput unrealized. We present RaMP, a routing-aware dispatch framework. A performance-region analysis derives, from hardware constants alone, when each optimization helps, correctly predicting all 8 tested architectures, including 3 unseen. A four-parameter wave cost model selects the fastest configuration from the runtime expert histogram, achieving 0.93% mean regret versus exhaustive search, fitted from just 10-24 minutes of one-time profiling per model. Because the model depends only on CTA grid geometry, it is kernel-agnostic: applied to Alpha-MoE, it delivers 1.14x with no source modification. Paired with a co-designed CuTe DSL kernel exposing 134-268 polymorphic configurations, RaMP delivers 1.22x kernel speedup over static dispatch and 1.30x end-to-end speedup in vLLM serving over Triton, 1.41x over DeepGEMM, and 1.13x over FlashInfer CUTLASS.
| Comments: | 10 pages, 8 figures, 9 tables. Preprint |
| Subjects: | Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC) |
| Cite as: | arXiv:2604.26039 [cs.LG] |
| (or arXiv:2604.26039v1 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2604.26039 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Vyom Sharma [view email]
[v1]
Tue, 28 Apr 2026 18:20:12 UTC (280 KB)
