Looks like it's about a year behind. Not that I am complaining. A year behind is good progress.

I also feel much of the trick is in the reasoning and harness.

so some progress around that would accelerate this process.

Harness certainly matters a lot, though GLM is pretty forgiving. I just had Opus tell me that based on numbers over the last week, from quite a few billion tokens total across half a dozen providers, GLM 5.1 has been more reliable for one of my projects than Sonnet... Just switching on 5.2 now.

How are you collecting your metrics on token usage and reliability?

They are from my own runs, with reliability measured in terms of passing extensive test suites. So caveat is that this applies for my specific use and might well vary greatly.

And what do you base this on ?

How does one objectively quantify how it stacks upnto another model ?

Or even, what is your subjective evaluation based on ?

I really wonder - because I have just finished a fully vibe-coded gtk/rust/lua application with me basically writing 7% of the code (all in one module) and GLM 5.1 writing the rest. We haven’t had regressions, confusion or anything else. And I am pretty damned sure I couldn’t manage this one year ago with claude code and Sonnet.

What harness, if you don't mind sharing?

Course not :)

I use pi (pi.dev).

I suspect some of the issue id that some harnesses are over-optimized for particular models and their preferences (tool calling, instructions to soften their deficiencies etc).

Pi is much more minimalist - probably a fairer point of comparison.

A different suspicion of mine is that some people over-specialize in a given model - or maybe become lazy with their prompts or suffer from skill issues.

Fwiw - I generally maintain a specs/ folder as I code.

I never use “plan” mode - I just tell the LLM to make no code changes, but discuss design with me.

At some point I am happy (I typically ask it to summarize and write the actual spec), I review; correct misunderstandings, ask for follow-up questions, we incorporate the additional details into the spec and move on.

I often have TODO’s/tasks in those specs too and I regularly update progress on them. It also happens that I ask the LLM to review my code (actual) against the spec and search for differences- we then resolve them. Sometimes by modifying the code; sometimes by modifying the spec.

For starters, I write an overview spec - nail down the big concepts and architectural choices at a high level. Moderately complicated facets of the application get their own spec - we write these as and when it gets relevant.

I think it helps the model a lot because I can refer to specs I feel relevant in drafting new specs or when solving tasks. And LLMs are generally better at proactively consulting these specs when getting an overview of the application and its design ahead of implementation.