I also confirm that local inference is on par with proprietary cloud services (with a bit of local setup, simple agents.md and some utils skills). This local models come with tools, that's mind blowing, considering that some months ago we had to .md tools ourselves. What makes a model worth even more is "Memory". We implemented that long ago. Last time I used proprietary services was 3 months ago, don´t really need it, my subscription is going blank.
Gerganov, hope you will consider developing further the CLI cause we suffering with the server.
what are you using for memory with your local models? is there a specific harness you would recommend for local agents?
I’m using Hermes at the moment - it comes with lots of tools already baked in for the agent to use - for example web and browser access just worked, rather than having to mess around loads with config scripts and plugins.
I’ve also tried OpenCode (similar but a bit less so) and Pi (fast but you have to add lots of features yourself which is a bit of a pain). Claude Code can also be pointed at a local model and works, but the default system prompt is huge. (~140k of text when I extracted mine, IIRC.)
I use HugstonOne (that backend a personalized version of llama.cpp). Implemented it´s own double layer memory that recall the full or partial previous session/file with an ON/OFF switch (which picks up where left off in CLI or Server or both same time) and another that reads back a % of current tab if memory switch is off doing checkpoints every certain tokens, summarizing and referring back to it when needed (recalled by certain logics). There is more to it when involving local RAG (making it tripple memory layer) but thats a long story.
About the harness depends on for what you need it, but basically for a universal unit of measure, Harness is multilayered and logic and domain specific dependent. I would definitely include Type of Hardware, Model parameters/knowledge, Model Intelligence, Model size/context, type of conversion, type and quantization (models comes with some default tools), but adding your (domain specific), skills, tools, memory, logs, security, Rag, Online search... (which as scary as they sound are mostly simple logics in a txt file, like if this do that).
The full pack is Harness 10, every missing thing lower the harness score.
To answer to your question I would definitely recommend smth like HugstonOne (or anyway llama.cpp CLI) with Qwen 3.6 35B finetuned/distill (deepseek 4 or claude 4.7) with none of the current coding agents out there that are screaming internet connection and proprietary API and data collection. DO this, if you can find a tool that you can download and choose a local model (of your choice in whatever folder locally) and load it ready for inference without any need of internet connection that is the tool you should aim for. Right now there is none out there.