Would be interesting to use local models for:
- tool calling
- code base exploration
- anonymizing / abstracting your request
Such that your local AI communicates to frontier model like an expensive consultant giving high level advice.
I think due to the lower latency of a local model that this could be faster.
One of the things I mentioned in the post:
> Local models can quickly read and explain codebases, even if they can't write them - this is a superpower
Might have been buried lower down.
And yes latency of local on a fast card with MTP enabled can be blistering 130-200 tokens per second sustained at full context on Q5. About 100+ on Q8.
On tool calling
> Agent Skills can help immensely - we had a local agent set up Slicer completely from scratch on a new mini PC. It even gave feedback on the usability of slicer CLI which we integrated
There's a link to a post showing some examples.
Occasionally, we'll also have the local model _review_ the changes of GPT/Opus - and it can return duds, but also insights the larger model overlooked, or was too intelligent to pick out.
So yes - absolutely blazing fast at understanding a codebase, very good at running skills "cheaply" and could be used with larger models as a "helper" / sub-agent.
I used Qwen 27b 8 bit MLX version on a decompiled android APK recently. It succesfully identified how it worked even the obfuscated classes and methods. It wrote a 1000 line documentation with examples but the time was dreadful. At some point it slowed down to 5 t/s so the whole thing took over an hour , the writing of documentation alone was over 40 minutes, fans blasting the entire time.
I know it uses electricity, but part of the benefit of a local model has to be that you can let it do this while you sleep, and not pay Anthropic for an unknown number of tokens.
yeah i totally understand and I am thoroughly impressed it works. And the electricity cost isn't that bad since it was on a ARM laptop (MacBook M3 Max) and not a beefy workstation with a GPU. I just let the agent do its work while watching the World Cup.
I doubt your experience of local models would be of lower latency, except for quite small models in edge uses.
In every way, the cloud products from the big two seem optimised for speed and speed of initial response even.
I don’t think most people are running local models for speed. More for control, privacy, interest, bloody-mindedness and general principle.