Hacker News

DwarfStar work in progress numbers: I see 14 tokens/sec generation, that slopes to 10 t/s with longer 10k or more context size. Consider that the indexed attention requires evaluating 2048 selected rows, 2x DeepSeek and with less compression, so the performances with larger contexts here to south faster. Prefill can be 180 t/s on small contexts to 150 t/s and less with larger contexts. I used DeepSeek v4 PRO in this conditions, it is usable but it is far from the 35 t/s 400 t/s prefill you get with DeepSeek v4 Flash 2 bit on a MacBook m5 max. But likely my implementation is yet not optimized enough, so a bit more performance can be obtained. I'm using 4 bit quants. The model is also definitely less sparse than DeepSeek v4, so it activates a bigger percentage of parameters. If it works decently at 2-bit, that would be a win even for machines where 4-bit fits, since this would mean 2x memory (equivalent) bandwidth basically for the routed experts.

Local inference needs really hard a 1.2 / 1.5 T/s memory bandwidth system with 512GB and 2/3 times the GPU compute of Mac Studio M3 Ultra, at an affordable 10/15k price point. A variant with 1TB memory would also be welcomed at 20k price point.

10k context is not a whole lot, this model theoretically supports up to 1M. But the KV cache storage takes up a whole lot more memory capacity at full context than DeepSeek V4 Pro, let alone Flash. (About ~96GB according to readily available KV cache calculators, might be more in practice. For comparison DeepSeek Flash is ~10GB and Pro is at least in that ballpark.) So I'm not sure that this model is a good deal for memory-constrained machines unless you're specifically interested in very short contexts only. This could still be worth it if it came with a game-changing increase in smarts but that seems a bit unlikely so far.

It will be interesting to see how this model does under a SSD streaming scenario, the lower sparsity should ideally be favorable.

> Local inference needs really hard a 1.2 / 1.5 T/s memory bandwidth system with 512GB and 2/3 times the GPU compute of Mac Studio M3 Ultra, at an affordable 10/15k price point. A variant with 1TB memory would also be welcomed at 20k price point.

Are these realistic specs at present? Not that clear to me, 1.5 T/s seems really high.