This happens with all agents I've used and package.json files for npm. Instead of using `npm i foo` the agent string-edits package.json and hallucinates some version to install. Usually it's a kind of ok version, but it's not how I would like this to work.

It's worse with renaming things in code. I've yet to see an agent be able to use refactoring tools (if they even exist in VS Code) instead of brute-forcing renames with string replacement or sed. Agents use edit -> build -> read errors -> repeat, instead of using a reliable tool, and it burns a lot more GPU...

Worse still I created a mcp with refactoring tools and symbol based editing but because it's a) of of distribution for llm b) agent get their own heavy handed system prompts all the goodies get ignored

> This happens with all agents I've used and package.json files for npm. Instead of using `npm i foo` the agent string-edits package.json and hallucinates some version to install.

When using codex, I usually have something like `Never add 3rd party libraries unless explicitly requested. When adding new libraries, use `cargo add $crate` without specifying the version, so we get the latest version.` and it seems to make this issue not appear at all.

Eventually this specific issue will be RLHF’d out of existence. For now that should mostly solve the problem, but these models aren’t perfect at following instructions. Especially when you’re deep into the context window.

> Especially when you’re deep into the context window.

Though that is, at least to me, a bit of an anti-pattern for exactly that reason. I've found it far more successful to blow away the context and restart with a new prompt from the old context instead of having a very long running back-and-forward.

Its better than it was with the latest models, I can have them stick around longer, but it's still a useful pattern to use even with 4.6/5.3

Opus has also clearly been trained to clear the context fairly often through the plan/code/plan cycle.

> brute-forcing renames with string replacement

That's their strategy for everything the training data can't solve. This is the main reason the autonomous agent swarm approach doesn't work for me. 20 bucks in tokens just obliterated with 5 agents exchanging hallucinations with each-other. It's way too easy for them to amplify each other's mistakes without a human to intervene.

For the first, I think maintaining package-add instructions is table stakes, we need to be opinionated here. Agents are typically good at following them, if not you can fall over to a Makefile that does everything.

For the second, I totally agree. I continue to hope that agents will get better at refactoring, and I think using LSPs effectively would make this happen. Claude took dozens of minutes to perform a rename which Jetbrains would have executed perfectly in like five seconds. Its approach was to make a change, run the tests, do it again. Nuts.

Totally. Surely the IDE’s like antigravity are meant to give the LLM more tools to use for eg refactoring or dependency management? I haven’t used it but seems a quick win to move from token generation to deterministic tool use.

As if. I’ve had Gemini stuck on AG because it couldn’t figure out how to use only one version of React. I managed to detect that the build failed because 2 versions of React were being used, but it kept saying “I’ll remove React version N”, and then proceeding to add a new dependency of the latest version. Loops and loops of this. On a similar note AG really wants to parse code with weird grep commands that don’t make any sense given the directory context.

[dead]