I was never able to get these models to collaborate with me the way Opus does. I'm probably an outliner, I don't one-shot projects, I don't vibe code. I basically use LLMs are if I was working with a coworker, fairly smart one, but with short memory and often missing the big picture. Sometimes I can delegate more, sometimes less, but I know I always have to stay on top of what's happening, because it WILL create mess when it hits something hard. With the Antropic models, this kind of cooperation is easy (with the exception of Opus 4.6, which was bad for some reason).
> Opus 4.6 which was bad for some reason
If I recall, that model had a couple issues. One was the issue of being monkeyed with, for which they gave everyone credits.
The other feature/bug, depending on your POV, was being Anthropic's least personable release, not papering over everything with self help guru therapy language.
Opus 4.6 didn't LARP. It was more direct, less fussy, less discussy, and very much less "wait, one more thing" within a couple edits after embarking on what should have been the spec, than 4.7 or 4.8 are.
When in engineer brain mode, working as as you describe (good old fashioned XP-style staff engineer pair programming with a language-savvy mentee not yet full-stack or system wise), I found the clearer I was about my goal and the better I could express it, the more often I'd get an expanded clarified response I could then iterate to steer for ever tighter cleaner more specified responses, then let it go build the whole thing without it agonizing and waffling.
The next two releases regressed on that dimension, wanting to figuratively "sit with" every decision and re-validate spiritual alignment along the way, no matter how clearly expressed.
Curiously to me, Fable seemed to hit the best of both worlds, I had the highest commit per turn with Fable, approaching 73%, where I'm usually under 17% of LOC written being good enough to commit, usually taking 9 - 11 turns to get the code where I'm comfortable with it.
Thanks to this, Fable cost more, but actually cost less, if that makes sense.
Arguably, Fable, and 4.6, played more outcome-correctness oriented than journey-experience oriented. It's easy to see how this could happen with human reinforced learning if not all judges are staff or principal engineer level, or constitution values are more Portlandia than Finlandia.
ANTHROP\C needs to balance these at the constitution level:
“We will work in a humane and thoughtful way, but production is the final judge. We will listen to people, but we will not let discussion replace decision. We will value craft, but not at the expense of usefulness. We will move fast, but not by hiding risk. We will measure outcomes, but not pretend that everything important is easy to measure.”
I considered Opus 4.5 to be the peak for a while. Opus 4.6 tended to over think, and generally get lost in thinking. I asked something and Claude Code would just spin for 15 minutes. And it was not the harness, if I changed the model to 4.5, it was fine again. So I skipped the following releases. I've been working with Opus 4.8 the last weeks and while I don't like how talkative it is, but it is fine to work with interactively. I've also used Fable for the few days it was available, and indeed, that was model worth using for my use case. To the point, but still very interactive.
A lot of open weight models don't understand intent well, they'll overfixate on a word in the prompt or just go off the rails trying to do much work.
GLM-5.2 actually has really good intent understanding though, on par with GPT-5.5 and Opus from my experience.
I'll have to try it. I was using earlier GLM models, incluing 5.1, and was always disappointed.
What do they do instead of collaborating?