Interesting article.
IMHO, the author could have done two things better:
- vllm instead of llama.cpp. With NVIDIA HW, there is huge difference in multi-user loads and caching with vllm; when he was complaining about what happens when more than one user uses the model, and about losing caching, I was "well, duh".
- The budget he used for a single card could have instead be put to far, far better use with SPARKs. I have access to a cluster of 2 x GX10 - total cost less than half what he paid, even today - and I am running vllm and Deepseek v4 Flash. The difference compared to any Qwen is tremendous - I've NEVER seen it loop, and in all my experiments so far, it's the most Sonnet-y model I've ever tried (antirez seems to agree, hence his ds4 fork).
If you're wondering about how I set it up in the 2 GX10s: https://forums.developer.nvidia.com/t/deepseek-v4-flash-offi...
Performance: 2K t/s prefill ( very useful for feeding tons of source code into its massive context window ) and around 50-60 tg/s in my coding sessions in the pi.dev harness. With the money the author paid, he could have bought 4 GX10s, and double both numbers ( vllm basically scales almost linearly with tensor parallelism ).
We did run vLLM on the 3090s — measured ~3 tok/s slower on generation for our single-to-few-user pattern, plus less flexibility on quant and slower startup (actual minutes vs single digit seconds). We may do more with it again in the future - there isn't unlimited time for us to tinker, I'm sharing our journey (so far) and reasoning.
It's the right call for concurrent batched serving (barrkel's point downthread is spot on), but for how we use it llama.cpp is still better for us.
The Spark/GX10 route is a genuinely different bet though and appreciate you sharing your numbers. At the time (several months ago) the consensus was that GX10s were for fine-tuning only, and the numbers were severely low.
..and the card was never about replacing a Claude Max sub. For the workloads we actually bought it for, it's giving us 140-200 tok/s (which matters).
I hear you on the insane amount of time vllm takes to launch (atlas is a move in the right direction in that regard).
But mostly I wanted to raise awareness to readers of your article that no, if you want to do inference, paying 15K for a single 96GB card almost certainly makes no sense. Buy 4 GX10s with the same money, and enjoy dramatically better models and user scalability.
Regardless - thanks for putting the effort to share your findings! I keep postponing doing the same... there's tons of things everyone is re-discovering on their own.
wanna chime in, recently tried vLLM to consume a NVFP4 Gemma4 safetensor model and see how the batching can show up in nice t/s numbers. it's slow to start, it's Linux only, it doesn't like WSL much, ended up with either old or nightly container builds, I more or less have given up. Appreciate how llama.cpp simply works and does things fast and obvious