Hacker News

> Standard GPUs

> 8× NVIDIA H200

as not custom chips like Grog and Cerebras. Did you expect a single GPU chip to reach 3k tps?

I think many would assume "not enterprise" or "not datacenter grade" when someone says "Standard GPUs", but maybe that specific phrase have a specific meaning I'm not familiar with.

Edit: I just tried a 4B model on a RTX Pro 6000, getting ~500 tok/s with llama.cpp not even trying to optimize or change anything, just default settings. I'm sure with vLLM it'd be a lot faster already, still before manually tuning configs. I wouldn't call that card "Standard GPU" either FWIW, but it makes the claimed performance numbers feel not as exciting, especially given the hardware they were using.

ismailmaj 6 hours ago [ - ]

I expected a 4090, maybe 2. I did not expect 8xH200 for a 2B model.

gaeld 6 hours ago [ - ]

Great points, let me clarify:

- model size: 2B is just for this preview (it was faster to implement), our article explains how we expect to support large frontier MoE at 1,000 to 5,000 tokens/s

- reaching 500 tok/s, or even up to ~1,000 tok/s, on a consumer GPU card is possible with existing inference engines like vLLM. But there is a ceiling.

The hard part comes we you try to be faster than that: these frameworks won't scale higher just by adding GPUs or using faster GPUs. There is a "glass ceiling" due to microseconds lost everywhere in the stack (grid syncs, inter-GPU comms, kernel launches, CPU sampling, etc.).

All our work at Kog is about removing these bottlenecks.

dr_kiszonka 10 minutes ago [ - ]

Thank you for explaining. Do you think there are still opportunities for stack optimizations to meaningfully speed up inference on single consumer-grade GPUs?

bcjdjsndon 3 hours ago [ - ]

That doesn't clarify anything lol. It's a bit click baity.

bcjdjsndon 3 hours ago [ - ]

> Did you expect a single GPU chip to reach 3k tps?

Did the article headline not say Standard GPU?

WithinReason 5 hours ago [ - ]

so what would be the above-standard GPUs then that they are excluding? Cerebras is not GPU

5 hours ago [ - ]

[deleted]

3 hours ago [ - ]

[deleted]

imputation 7 hours ago [ - ]

Everyone beholden to a data center or subject to the installation on the corner of your property of course. Keep up with the times... /s