Great points.
We strived to be fair as possible in the benchmark, but it's indeed not perfect. Taalas should have been added in the dedicated hardware section, even though they use 3-bit quantization when we are on FP16 (to be fair in both directions) and they burn the model directly on the card.
Our tech preview is about the speed (hence the small dense model, it was easier to implement).
The math checks out though to allow support for large frontier MoE models at similar speeds: - At batch size 1, GPT-OSS-120B has 5.1B active parameters - in FP8, it's in the same size ballpark than our 2B model in FP16 (5.1 GB vs 4GB). - DeepSeek V4 Flash has 13B in mixed FP4/FP8, so let's say ballpark around 3x bigger than 4GB - so in theory we could reach >1,000 tok/s on it with MI300X/H200 and up to 4k on next generation GPUs.
Check out the math at the end of our blog post:
https://blog.kog.ai/real-time-llm-inference-on-standard-gpus...
Your playground/write-up is very interesting and I would be really interested when you can have something like Deepseek V4 Flash model (49B) running as you are suggesting.
I haven't read the article at the moment and I will try to read them hopefully but I wish to ask a question regarding, can this approach be done for say trillion or large parameter models as well or is there some wall which gets hit that makes it valuable for only smaller parameter model.
That being said, its still really incredible because in future, because these small models are really getting good for many use cases and speed becomes their bottleneck, with greater speeds at consumer hardware, I think its gonna be amazing work!
Consumer inference scenarios tend to be highly bespoke so it's difficult to apply a monokernel approach based on deep manual optimization. I suppose this could become applicable to rare scenarios where both the model and the hardware are fixed and self-contained, e.g. I'm running Apple's AI model on the latest Apple Silicon hardware. Then this becomes a viable approach even for 'consumer' use.
The authors' approach also encompasses multi-node approaches that won't apply easily to consumer inference since consumer GPUs have very low-performance interconnects, hence why layer parallelism is usually favored. (But that doesn't work very well with the monokernel approach, since it involves running distinct logic on each separate GPU. It also doesn't speed up single inference, though you can get that throughput back by pipelining small minibatches.)
Thanks for the comment and the question!
The last section of the article lays out the scaling laws that apply when porting this approach to another model. In a nutshell, DeepSeek V4 Pro with 49B active params is close to the upper bound.
Also worth noting that our results are currently for standard datacenter GPUs. On consumer hardware, though the same low-level optimization approach applies, the bandwidth limitations will cap the achievable speed.