> About the generation speed: ~100-150 t/s on the RTX 5090 and ~40 t/s on the Mac
Curious if you can share the prefill speed too?
I run locally on a crappy desktop (some AMD iGPU with Vulkan llama.cpp, 32 GB DDR4 RAM) for experimentation. I get 15 tok/s on generation for the qwen & gemma4 MoE models. I get around 150 tok/s prefill speed.
Reason I'm asking about the prefill is looking at my stats at work, I use between 20M to peaks of 300M input tokens daily. Some of those token are cached but in general, I seem to have roughly 500x more input tokens than output. So interested in prefill tok/s stats.
Huge Thank you for llama.cpp btw!!
Here are the prefill speeds:
Btw, based on your numbers, I think our use cases are quite different. I use the agent for very targeted sessions - basically things that are clear to me how to do, just want to automate them. My workflow is usually: new session -> read this, this and this -> do that. I.e. I don't let it wander at all in the codebase, so I rarely exceed the context window.Also, I get a lot of mileage from the ngram-based speculative decoding functionality [0] as it allows me to iterate on the implementation much faster.
[0] https://github.com/ggml-org/llama.cpp/pull/19164
Thanks! Super helpful.
I do use it the same way as you're describing on personal projects at home, in a very crude manner (pasting code snippets in llama server web UI prompt. Next will attempt OpenCode)
At work I use it in similar manner with more mature tools, but the vast majority of token spend comes from a totally different workflow: "pretend the AI is a fleet of junior/intern engineer you're delegating work to", where the agent will on its own do the implementation, commit the changes etc.
It does indeed spend a lot of tokens wandering the codebase, talking to MCPs, loading skills etc.