I'm really curious about this, not because I disagree, but because I want to avoid agents going whack. Are you running vllm for yourself only, or a for a team, or for an application, etc? And do you feel there is a minimum hardware requirement for vllm to be useful in this way?
My weekend project is going to be building a home inference server (from ancient datacenter parts) and I'm still massaging in my head what the end result will be.
If I started today, with building a server, I'd jump right into verified set-ups and writeups, like this one:
https://github.com/noonghunna/club-3090
You can find info about running a patched version of vllm for 1x24gb, 2x and 4x. There's also quite a few "blackwell" subreddits, where people seem to share a lot of substantial information, if you're going the 6000 route.
That writeup is completely unhinged and utterly incomprehensible to follow.
It just throws "you can do <large number>" at you, with no real explainer regarding how it manages that and which trade-offs are made. I still don't know for certain, but I think one of those trade-offs is 3 bit context? Which is a terrible idea.
Please don't share these walls of noise. They shouldn't exist