Hacker News

> The MI100 is roughly double the performance on Qwen 3.5 35B A3B Q5_K_M to the R9700 (462 token/s prefill vs 239 tokens/s, 217 tokens/s vs 118 token/s for inference)

Those prefill numbers look really low to me. I can run nearly that same model (qwen 3.6) at q4km with q6 cache on a single 3090 and get 2.3k-4.4k prefill and 100-170 generation. Just based on raw numbers I would expect the R9700 to land around 70-90 generation (about 2/3 of memory bandwidth of a 3090) and at least the same or higher prefill (nearly 3x FP16 TOPS on the R9700). That means the numbers really don't add up. Is the benchmark done with some special settings, e.g. parallel requests or with very low prompt length?

sonzohan 2 hours ago [ - ]

Numbers are from https://www.fitmyllm.com/ so they're not a real hardware benchmark just what you're expected to get. YMMV.

rft an hour ago [ - ]

Ah, ok. I took a look at the 3090 numbers and they list 400 tok/s prefill, so if I normalize my expectations to that base line the numbers you posted do make sense. I haven't dug deep into that site's methodology, but their estimates seems way off. Especially since they don't take into account cache quant when deciding whether or not you can run a model. Overall I found that website a bit confusing, but maybe the UX just didn't click with me.