Hacker News

6 tokens per second is not fit for interactive use. I find Gemma 4 (QAT 4-bit, MTP) to be tolerable at about 30 tokens per second on my old GPUs. Anything slower than 15 is annoying. I tried DS4 on my Strix halo (1-bit quantization of DeepSeek V4 Flash, the biggest model that can realistically run on 128GB, right now), and it tops out at something like 10 or 11 with a long time to first response, and that's quite painful to use. I'd definitely rather spend money to use the big models on cloud infrastructure.

And, the several thousand dollars it costs to run these things unusably slowly buys a lot of tokens on the cheap Chinese models.