10 min read
1 day ago
--
Press enter or click to view image in full size
I did something this week that I assumed would be a slow, frustrating downgrade: I unplugged Claude Code from Anthropic’s cloud and pointed it at a model running entirely on my own MacBook. No API key. No per-token bill. No data leaving the machine. I expected a toy. Instead I got a server that pushed 525 tokens per second on a small model, scaled to 4.3x aggregate throughput at 16 concurrent requests, and beat llama.cpp — the undisputed king of on-device inference — by up to 87% on Apple’s own Metal backend. The project is called vllm-mlx, and it shouldn’t be this good.
That last part is the surprise. llama.cpp is the gold standard for running models locally. It is battle-tested, it has the broadest model support of anything in the ecosystem, and it has years of hand-tuned Metal kernels behind it. A scrappy MLX-based server written to bring vLLM-style serving to Apple Silicon is not supposed to walk in and win on throughput across every model I threw at it. But that is exactly what the benchmarks — and my own test run — show.
Here is everything I found after running vllm-mlx across six models, a concurrency sweep, and a multimodal cache test on an M4 Max.
