Impressive work. But the problem is not the 30 tok/s which is fine for agentic coding and chat.
It's prefill; slow prefill kills agentic workloads dead.
If you have 100,000 tokens at ~150tok/s per the OP, you're looking at:
You have: 100000 / (150/s)
You want: hms
11 min + 6.6666667 sec
Which is quite a wait indeed.
Most people won’t be dumping 100K tokens into it at once, but I agree that all of the prefill time that adds up during a session becomes a lot to account for.
This is also a problem for all of the Mac local LLMs. Macs are a great way to get a lot of high bandwidth memory, but their compute is very far behind current gen dedicated GPUs. Some of the expensive Mac Studio setups allow you to run very large models with usable tokens/s, but you can be waiting a long time for it to get to the point of generating those tokens.
The prefix cache is working properly 100k doesn’t prefill more than once
When you're using OpenCode it's easy to reach 100,000 tokens after a while.
I wonder if this could be usefully mitigated with a combination of prompt (prefix) caching and an agent that let you control what the prompt prefix consisted of. The goal would be to incur that slow prefill once to build the prompt cache, then have subsequent prompts consist of mostly this fixed prefix plus specific instructions.
For a language like C++ where modules are split into definition (.h) and implementation (.cpp) parts, one choice of prefix would be all the header files for the project (which aren't likely to change much).
More generally the idea would be to have an agent that had cached-prefix reuse as it's primary context management goal.
Another possibility, to support caching of files that have since changed, would be for the agent to build the context as a fixed prefix reflecting some or all of the codebase in its start-of-session state, then append any changes to that, with appropriate prompting to only use the latest definition of a function.
e.g.
Say file A initially contains functions X, Y and Z, then the prompt prefix is built to include X Y Z. If the user then modifies Y -> Y', then just add that to the context, so that the cached prefix is unchanged, giving X Y Z Y'.
A quick search say that this is a standard feature you cache the prefill and load it at PCIe bandwidth so it should be about 0.2s
[dead]