Context editing is interesting because most agents work on the assumption that KV cache is the most important thing to optimise and are very hesitant to remove parts of the context during work. It also sometimes introduces hallucinations, because parts of the context are with the assumption that eg tool results are there, but theyre not. Example Manus [0]. Eg, read file A, make changes on A. Then prompt on some more changes. If you now remove the "read file A" tool results, not only you break the cache, but in my own agent implementations(on gpt 5 at least) can hallucinate now since my prompt etc all naturally point to the content of the tool still beeing there.
Plus, the model got trained and RLed with a continuous context, except if they now tune it with messing with the context as well.
https://manus.im/blog/Context-Engineering-for-AI-Agents-Less...
Yes we had the same issue with our coding agent. We found that instead of replacing large tool results in the context it was sometimes better to have two agents, one long lived with smaller tool results produced by another short lived agent that would actually be the one to read and edit large chunks. The downside of this is you always have to manage the balance of which agent gets what context, and you also increase latency and cost a bit (slightly less reuse of prompt cache)
I found that having sub agents just for running and writing unit tests got me over 90% of my context woes
Seems like that could be a job local LLMs do fairly well soon; not a ton of reasoning, just a basic ability to understand functions and write fairly boilerplate code, but it involves a ton of tokens, especially if you have lots of verbose output from a test run. So doing it locally could end up being a huge cost savings as well.
Maybe but you still need to pass some context to the sub agent to describe the system under test.
this sounds like a good approach, i need to try it. I had good results with using context7 in specialized docs agent. I wasn't able how to limit MCP to a subagent, likely its not supported.
We often talk about "hallucinations" like it is its own thing, but is there really anything different about it from the LLM's normal output?
AFAICT, no. I think it just means "bad, unhelpful output" but isn't fundamentally different in any meaningful way from their super-helpful top-1% outputs.
It's kind of qualitatively different from the human perspective, so not a useless concept, but I think that is mainly because we can't help anthropomorphizing these things.