I found it interesting that vLLM was dismissed as slower than llama.cpp.
IME vLLM is quite a bit faster than llama.cpp but where it really wipes the floor with it is in batching concurrent load. The downside is that it is dramatically less flexible in terms of tweaking. It gives you very few options for running quantized weights. It takes a lot longer to start up because it optimizes the compute graph. So for single user experimentation on a model that's a bit too big for your box, vLLM is just going to be frustrating.
vLLM is great at continuous batching and model serving in production, but it's a very different beast and much less versatile for the prosumer category (where we sit for our usage)
Dismissed is a strong term, but let me give you some more details.
It took a good 4 minutes plus to load up on the 2x 3090 rig, and served a single request 3 tokens/second slower.
And the worst bit? With all that work - setting it up and tuning it - it still looped. I was hoping "use just vLLM" advice that we get touted everywhere was the silver bullet.
The only thing I'd caution here is that we don't start bashing on llama.cpp like people did with Ollama. It's a very capable tool and for the use-cases we actually want the card for makes more sense.
For a large team replacing their Claude Subs perhaps vLLM is the only option, but you really need to add about 5 more RTX 6000 cards into the mix, so you can load something like GLM 5.2.
Bashing on ollama is totally warranted, since ollama is a UI skin around llama.cpp and that's it. If all you cared about was "I want to run a model and use it via an API" then the only thing it did was give you a GUI to download models (vs browsing HuggingFace yourself and downloading .gguf files yourself) and a GUI with a button labeled "run" (instead of a run.sh or run.bat script launching llama-server).
That's not _nothing_, but it's pretty close to nothing, and for the prosumer crowd it edges towards "just gets in the way".
One could say: vLLM isn't a worse Llama.cpp, it's a different tool
AFAIR the general consensus is (was?): - llama.cpp for single user - vLLM for multi-user (e.g. enterprises)
They are similar, but for different use cases.
Yeah, I was a bit baffled by the author complaining about cache prefixes getting destroyed when more than one user hit the model, but then continuing to use llama.cpp instead of switching to vLLM.