> Standard GPUs

> 8× NVIDIA H200

as not custom chips like Grog and Cerebras. Did you expect a single GPU chip to reach 3k tps?

I think many would assume "not enterprise" or "not datacenter grade" when someone says "Standard GPUs", but maybe that specific phrase have a specific meaning I'm not familiar with.

Edit: I just tried a 4B model on a RTX Pro 6000, getting ~500 tok/s with llama.cpp not even trying to optimize or change anything, just default settings. I'm sure with vLLM it'd be a lot faster already, still before manually tuning configs. I wouldn't call that card "Standard GPU" either FWIW, but it makes the claimed performance numbers feel not as exciting, especially given the hardware they were using.

I expected a 4090, maybe 2. I did not expect 8xH200 for a 2B model.

Great points, let me clarify:

- model size: 2B is just for this preview (it was faster to implement), our article explains how we expect to support large frontier MoE at 1,000 to 5,000 tokens/s

- reaching 500 tok/s, or even up to ~1,000 tok/s, on a consumer GPU card is possible with existing inference engines like vLLM. But there is a ceiling.

The hard part comes we you try to be faster than that: these frameworks won't scale higher just by adding GPUs or using faster GPUs. There is a "glass ceiling" due to microseconds lost everywhere in the stack (grid syncs, inter-GPU comms, kernel launches, CPU sampling, etc.).

All our work at Kog is about removing these bottlenecks.

Thank you for explaining. Do you think there are still opportunities for stack optimizations to meaningfully speed up inference on single consumer-grade GPUs?

That doesn't clarify anything lol. It's a bit click baity.

> Did you expect a single GPU chip to reach 3k tps?

Did the article headline not say Standard GPU?

so what would be the above-standard GPUs then that they are excluding? Cerebras is not GPU

[deleted]
[deleted]

Everyone beholden to a data center or subject to the installation on the corner of your property of course. Keep up with the times... /s