I think the author is looking for something that doesn't exist (yet?). I don't think there's an agent in existence that can handle a list of 128 tasks exactly specified in one session. You need multiple sessions with clear context to get exact results. Ralph loops, Gastown, taskmaster etc are built for this, and they almost entirely exist to correct drift like this over a longer term. The agent-makers and models are slowly catching up to these tricks (or the shortcomings they exist to solve); some of what used to be standard practice in Ralph loops seems irrelevant now... and certainly the marketing for Opus 4.7 is "don't tell it what to do in detail, rather give it something broad".

In fairness to coding agents, most of coding is not exactly specified like this, and the right answer is very frequently to find the easiest path that the person asking might not have thought about; sometimes even in direct contradiction of specific points listed. Human requirements are usually much more fuzzy. It's unusual that the person asking would have such a clear/definite requirement that they've thought about very clearly.

Not with tools + supporting (traditional) code.

Just as a human would use a task list app or a notepad to keep track of which tasks need to be done so can a model.

You can even have a mechanism for it to look at each task with a "clear head" (empty context) with the ability to "remember" previous task execution (via embedding the reasoning/output) in case parts were useful.

The article makes it seem like the author expected this without emptying context in between, which does not yet exist (actually I'm behind on playing with Opus 4.7, the Anthropic claim seems to be that longer sessions are ok now - would be interested to hear results from anyone who has).

That is probably the next step, and in practice it is much of what sub-agents already provide: a kind of tabula rasa. Context is not always an advantage. Sometimes it becomes the problem.

In long editing sessions with multiple iterations, the context can accumulate stale information, and that actively hurts model performance. Compaction is one way to deal with that. It strips out material that should be re-read from disk instead of being carried forward.

A concrete example is iterative file editing with Codex. I rewrite parts of a file so they actually work and match the project’s style. Then Codex changes the code back to the version still sitting in its context. It does not stop to consider that, if an external edit was made, that edit is probably important.

I have the same experience of reversing intentional steps I've made, but with Claude Code. I find that committing a change that I want to version control seems to stop that behaviour.

Long context as disadvantage is pretty well discussed, and agent-native compaction has been inferior to having it intentionally build the documentation that I want it to use. So far this has been my LLM-coding superpower. There are also a few products whose entire purpose is to provide structure that overcomes compaction shortcomings.

When Geoff Huntley said that Claude Code's "Ralph loop" didn't meet his standards ("this aint it") the major bone of contention as far as I can see was that it ran subagents in a loop inside Claude Code with native compaction; as opposed to completely empty context.

I do see hints that improving compaction is a major area of work for agent-makers. I'm not certain where my advantage goes at that point.

Agreed. I am asking for something beyond the current state of the art. My guess is that stronger RL on the model side, together with better harness support, will eventually make it possible. However, it's the part about framing the failure to do complete a task as a communication mishap that really makes me go awry.