My experience lines up with the article. The agentic stuff only works with the biggest models. (Well, "works"... OpenAI Codex took 200 requests with o4-mini to change like 3 lines of code...)
For simple changes I actually found smaller models better because they're so much faster. So I shifted my focus from "best model" to "stupidest I can get away with".
I've been pushing that idea even further. If you give up on agentic, you can go surgical. At that point even 100x smaller models can handle it. Just tell it what to do and let it give you the diff.
Also I found the "fumble around my filesystem" approach stupid for my scale, where I can mostly fit the whole codebase into the context. So I just dump src/ into the prompt. (Other people's projects are a lot more boilerplatey so I'm testing ultra cheap models like gpt-oss-20b for code search. For that, I think you can go even cheaper...)
Patent pending.
Aider as a non-agentic coding tool strikes a nice balance on the efficiency vs effectiveness front. Using tree-sitter to create a repo map of the repository means less filesystem digging. No MCP, but shell commands mean it can use utilities I myself am familiar with. Combined with Cerebras as a provider, the turnaround on prompts is instant; I can stay involved rather than waiting on multiple rounds of tool calls. It's my go-to for smaller scale projects.
Just added a fork of aider that does do agentic commands: https://github.com/sutt/agent-aider
In testing I've found it to be underwhelming at being an agent compared to claude code, wrote up some case-studies on it here: https://github.com/sutt/agro/blob/master/docs/case-studies/a...
It's a shame MCP didn't end up using a sandboxed shell (or something similar, maybe even simpler.) All the pre-MCP agents I built just talked to the shell directly since the models are already trained to do that.
I am developing the same opinion. I want something fast and dependable. Getting into a flow state is important to me, and I just can't do that when I'm waiting for an agentic coding assistant to terminate.
I'm also interested in smaller models for their speed. That, or a provider like Cerebras.
Then, if you narrow the problem domain you can increase the dependability. I am curious to hear more about your "surgical" tools.
I rambled about this on my blog about a week ago: https://hpincket.com/what-would-the-vim-of-llm-tooling-look-...
well, most of the time, I just dump the entire codebase in if the context window is big and its a good model. But there are plenty of times when I need to block one folder in a repo or disable a few files because the files might "nudge" it in a wrong direction.
The surgical context tool (aicodeprep-gui) - there are at least 30 similar tools but most (if not all) are CLI only/no UI. I like UIs, I work faster with them for things like choosing individual files out of a big tree (at least it is using PySide6 library which is "lite" (could go lighter maybe), i HATE that too many things use webview/browsers. All the options on it are there for good reasons, its all focused on things that annoy me..and slow things down: like doing something repeatedly (copy paste copy paste or typing the same sentence over and over every time i have to do a certain thing with the AI and my code.
If you have not run 'aicp' (the command i gave it, but also there is a OS installer menu that will add a Windows/Mac/Linux right click context menu in their file managers) in a folder before, it will try to scan recursively to find code files, but it skips things like node_modules or .venv. but otherwise assumes most types of code files will probably be added so it checks them. You can fine tune it, add some .md or txt files or stuff in there that isn't code but might be helpful. When you generate the context block it puts the text inside the prompt box on the top AND/OR bottom - doing both can get better responses from AI.
It saves every file that is checked, and saves the window size, other window prefs, so you don't have to resize the window again. It saves the state of which files are checked so its less work / time next time. I have been just pasting the output from the LLMs into an agent like Cline but I am wondering if I should add browser automation / browser extension that does the copy pasting and also add option to edit / change files right after grabbing the output from a web chat. Its probably about good enough as it is though, not sure I want to make it into a big thing.
--- Yeah I just keep coming back to this workflow, its very reliable. I have not tried Claude Code yet but I will soon to see if they solved any of these problems.
Strange this thing has been at the top of hacker news for hours and hours.. weird! My server logs are just constant scrolling
Thanks for the article. I'm also doing a similar thing, here are my tips:
- https://chutes.ai - 200 requests per day if you deposit (one-time) $5 for top open weights models - GLM, Qwen, ...
- https://github.com/marketplace/models/ - around 10 requests per day to o3, ... if you have the $10 GitHub Copilot subsciption
- https://ferdium.org - I open all the LLM webapps here as separate "apps", my one place to go to talk with LLMs, without mixing it with regular browsing
- https://www.cherry-ai.com - chat API frontend, you can use it instead of the default webpages for services which give you free API access - Google, OpenRouter, Chutes, Github Models, Pollinations, ...
I really recommend trying a chat API frontend, it really simplifies talking with multiple models from various providers in a unified way and managing those conversations, exporting to markdown, ...
With chutes.ai, where do you see a one-time $5 for 200 requests/day?
Have you seen this? https://github.com/robertpiosik/CodeWebChat
aicodeprep-gui looks great. I will try it out
I agree. I find even Haiku good enough at managing the flow of the conversation and consulting larger models - Gemini 2.5 Pro or GPT-5 - for programming tasks.
Last few days I am experimenting with using Codex (via MCP ${codex mcp}) from Gemini CLI and it works like a charm. Gemini CLI is mostly using Flash underneath but this is good enough for formulating problems and re-evaluating answers.
Same with Claude Code - I am asking (via MCP) for consulting with Gemini 2.5 Pro.
Never had much success of using Claude Code as MCP though.
The original idea comes of course from Aider - using main, weak and editor models all at once.
They don't allow model switching below GPT-5 in codex cli anymore (without API key), because it's not recommended. Try it with thinking=high and it's quite an improvement from o4-mini. o4-mini is more like gpt-5-thinking-mini but they don't allow that for codex. gpt-5-thinking-high is more like o1 or maybe o3-pro.
For those who don't know, OpenAI Codex CLI will now work with your ChatGPT plus or pro account. They barely announced it but it's on their github page. You don't have to use an api key.
I use a 500 million parameter model for editor completions because I want those to nearly instantaneous and the plugin makes 50+ completion requests every session.
What editor do you use, and how did you set it up? I've been thinking about trying this with some local models and also with super low-latency ones like Gemini 2.5 Flash Lite. Would love to read more about this.
Neovim with the llama.cpp plugin and heavily quantized qwen2.5-coder with 500 (600?) million parameters. It's almost plug and play although the default ring context limit is way too large if you don't have a GPU.
Can you share which model you are using?
Which model and which plugin, please?
You should try GLM 4.5; it's better in practice than Kimi K2 and Qwen3 Coder, but it's not getting much hype.
> (Well, "works"... OpenAI Codex took 200 requests with o4-mini to change like 3 lines of code...)
Let’s keep something in reason, I have multiple times in my life spent days on what would end up to be maybe three lines of code.