In March, vLLM picked up some of the improvements in the DeepSeek paper. Through these, vLLM v0.7.3's DeepSeek performance jumped to about 3x+ of what it was before [1].

What's exciting is that there's still so much room for improvement. We benchmark around 5K total tokens/s with the sharegpt dataset and 12K total token/s with random 2000/100, using vLLM and under high concurrency.

DeepSeek-V3/R1 Inference System Overview [2] quotes "Each H800 node delivers an average throughput of 73.7k tokens/s input (including cache hits) during prefilling or 14.8k tokens/s output during decoding."

Yes, DeepSeek deploys a different inference architecture. But this goes onto show just how much room there is for improvement. Looking forward to more open source!

[1] https://developers.redhat.com/articles/2025/03/19/how-we-opt...

[2] https://github.com/deepseek-ai/open-infra-index/blob/main/20...