I also recently decided to buy a datacenter GPU and slap it into a system. Some notes from my experience that the author doesn't mention in their article:
Decommissioned NVIDIA V100s and AMD MI50s are fairly cheap, $200 for 16gb and $400-500 for 32gb, for local experimentation. They are also very old. There's an enthusiast community keeping these two cards alive and working with current platforms and models.
Nitpick, but the V100 doesn't support bfloat16. The performance hit is not a big deal if you're fiddling with local models, but the card is on it's way out in terms of hardware features.
The MI50 does support bf16, but not the current edition of AMD ROCm. Vulkan support is good and the MI50 works with most major platforms (llama.cpp, vllm, etc.), but it's not without some pain points like manual recompilation. Fortunately the open source community has already paid most of your way.
The cooling requirements for these cards cannot be understated. A consumer grade GPU may throttle if in a small case without additional fans, but if given the same treatment a datacenter GPU will overheat itself idling. You will need to buy, at least, a bunch of decent 120mm fans to prevent this or invest in some water cooling.
I ultimately went with an AMD MI100 32GB ($950). I'm an AMD fan, current ROCm editions support it, and it was low-fuss to get things working. I'm debating getting a second so I can try out bigger models like qwen3-coder-next.
> You will need to buy, at least, a bunch of decent 120mm fans to prevent this or invest in some water cooling
There's a cottage industry of 3D-printed fan-shrouds for data center GPUs - 120mm are often the sweet spot for quietness and practicality. The shoud smugly fits the GPUs intake, so it gets all the airflow from the attached fan(s), whose speed curves can be attached to GPU temperature.
Did you consider the R9700 or B70 when you went for the MI100? If so, what made you choose the MI100?
I've been playing with picking up a card in this class but haven't been able to justify it when running the Qwen3.6 MOE model on a 6800xt is tolerable for the type of projects I've been willing to point local AI at.
I looked at those, the Arc 1100, the w6800, MI50, MI60, v100, v620, and basically anything with 32gb of RAM:
1. I wanted an AMD card.
2. I have an RTX 3090 that's been fun to play with, but I want to get back to using it for gaming.
3. I was looking for between 30-60 tokens/second in terms of performance on the beefier models I want to run. Looking at stock Qwen3 32B the benchmarks reported about 41 tokens/second for MI100. w6800 was 18, MI50 & MI60 could do 60s but had a lot of compromises/special things to achieve that.
4. I used FitMyLLM for some spec-based comparisons (https://www.fitmyllm.com/). The MI100 is roughly double the performance on Qwen 3.5 35B A3B Q5_K_M to the R9700 (462 token/s prefill vs 239 tokens/s, 217 tokens/s vs 118 token/s for inference)
5. I was willing to throw up to $1k at a GPU; I really wanted to throw closer to $650.
To be honest, if money was no objection I would've sprung for a MI210. I also considered the MI250 as they showed up for $1250-1400 with a whopping 128GB, but the PCIE converters for that form factor don't have working AMD drivers yet.
> The MI100 is roughly double the performance on Qwen 3.5 35B A3B Q5_K_M to the R9700 (462 token/s prefill vs 239 tokens/s, 217 tokens/s vs 118 token/s for inference)
Those prefill numbers look really low to me. I can run nearly that same model (qwen 3.6) at q4km with q6 cache on a single 3090 and get 2.3k-4.4k prefill and 100-170 generation. Just based on raw numbers I would expect the R9700 to land around 70-90 generation (about 2/3 of memory bandwidth of a 3090) and at least the same or higher prefill (nearly 3x FP16 TOPS on the R9700). That means the numbers really don't add up. Is the benchmark done with some special settings, e.g. parallel requests or with very low prompt length?
Numbers are from https://www.fitmyllm.com/ so they're not a real hardware benchmark just what you're expected to get. YMMV.
Ah, ok. I took a look at the 3090 numbers and they list 400 tok/s prefill, so if I normalize my expectations to that base line the numbers you posted do make sense. I haven't dug deep into that site's methodology, but their estimates seems way off. Especially since they don't take into account cache quant when deciding whether or not you can run a model. Overall I found that website a bit confusing, but maybe the UX just didn't click with me.
> if given the same treatment a datacenter GPU will overheat itself idling
I have a friend who has learned this through several server grade cards over the years.
Yes your Intel 10G NIC was cheap. No you cannot just stick it in your desktop. It is expecting server level airflow, probably with a cold intake side.
He printed a fan mount, slapped it on, and they’ve been happy together since.
qwen3-coder-next runs fine on my consumer grade nvidia 4070. Performance is not spectacular, but it's only a little bit slower than a properly-fit model.
What are your settings and tokens/second? Even with 2 GPUs (MI100, RX 6600 XT 8GB) and 32GB of RAM it was running at a snails pace for me.
I didn't try a sched_spread with a 3090 and the MI100 which would provide 56GB ram