Comparison with a RTX Pro 6000, with DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf:
prefill: 121.76 t/s, generation: 47.85 t/s
Main target seems to be Apple's Metal, so makes sense. Might be fun to see how fast one could make it go though :) The model seems really good too, even though it's in IQ2.
I don't want to be a jerk but 31t/s prefill is basically unusable in an agentic situation. A mere 10k in context and you're sitting there for 5+ minutes before the first token is generated.
Healthy!
From https://gist.github.com/simonw/31127f9025845c4c9b10c3e0d8612...Comparison with a RTX Pro 6000, with DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf:
prefill: 121.76 t/s, generation: 47.85 t/s
Main target seems to be Apple's Metal, so makes sense. Might be fun to see how fast one could make it go though :) The model seems really good too, even though it's in IQ2.
I don't want to be a jerk but 31t/s prefill is basically unusable in an agentic situation. A mere 10k in context and you're sitting there for 5+ minutes before the first token is generated.
That prefill number isn't right. M4 Max hits 200-300: https://github.com/antirez/ds4/blob/main/speed-bench/m4_max_...
M5 studio is gonna sell like hot cakes
Hah, that's because the prompt itself was only about 30 tokens. We need a much bigger prompt to properly test PP.
if it's just the coding agent system prompt and tools, you can cache that
Yeah the problem is that's just the start of the context. There's, you know, all the tool call results and file reads and stuff.
what are token speeds like for frontier models, if that gives a rough idea of how much "slower" slow is?