Stealthily degrade the model or stealthily constrain the model with a tighter harness? These coding tools like Claude Code were created to overcome the shortcomings of last year's models. Models have gotten better but the harnesses have not been rebuilt from scratch to reflect improved planning and tool use inherent to newer models.

I do wonder how much all the engineering put into these coding tools may actually in some cases degrade coding performance relative to simpler instructions and terminal access. Not to mention that the monthly subscription pricing structure incentivizes building the harness to reduce token use. How much of that token efficiency is to the benefit of the user? Someone needs to be doing research comparing e.g. Claude Code vs generic code assist via API access with some minimal tooling and instructions.

I've been using pi.dev since December. The only significant change to the harness in that time which affects my usage is the availability of parallel tool calls. Yet Claude models have become unusable in the past month for many of the reasons observed here. Conclusion: it's not the harness.

I tend to agree about the legacy workarounds being actively harmful though. I tried out Zed agent for a while and I was SHOCKED at how bad its edit tool is compared to the search-and-replace tool in pi. I didn't find a single frontier model capable of using it reliably. By forking, it completely decouples models' thinking from their edits and then erases the evidence from their context. Agents ended up believing that a less capable subagent was making editing mistakes.

[deleted]

Are you using Pi with a cloud subscription, or are you using the API?

Out of curiosity, what can parallel tool calls do that one can't do with parallel subagents and background processes?

I feel like "feature/model freeze" may be justified

just call it something like "[month][year]edition" and work on next release

users spend effort arriving to narrow peak of performace, but every change keeps moving the peak sideways

The changes to reduce inference costs are intentional. Last thing you're going to do is have users linger on an older version that spends much more. This is essentially what's going on with layers upon layers of social engineering on top of it.

Love your point. Instructions found to be good by trial and error for one LLM may not be good for another LLM.

> Love your point. Instructions found to be good by trial and error for one LLM may not be good for another LLM.

Well, according to this story, instructions refined by trial and error over months might be good for one LLM on Tuesday, and then be bad for the same LLM on Wednesday.

Agree: it is Anthropic's aggressive changes to the harnesses and to the hidden base prompt we users do not see. Clearly intended to give long right tail users a haircut.