I hear you on the insane amount of time vllm takes to launch (atlas is a move in the right direction in that regard).
But mostly I wanted to raise awareness to readers of your article that no, if you want to do inference, paying 15K for a single 96GB card almost certainly makes no sense. Buy 4 GX10s with the same money, and enjoy dramatically better models and user scalability.
Regardless - thanks for putting the effort to share your findings! I keep postponing doing the same... there's tons of things everyone is re-discovering on their own.
wanna chime in, recently tried vLLM to consume a NVFP4 Gemma4 safetensor model and see how the batching can show up in nice t/s numbers. it's slow to start, it's Linux only, it doesn't like WSL much, ended up with either old or nightly container builds, I more or less have given up. Appreciate how llama.cpp simply works and does things fast and obvious