Chinese models are almost certainly cheating on benchmarks, I would bet if you saw the training data that the benchmark canaries are in there.

GLM may be a good model in general but it s benchmaxxed and definitely not as good as Opus 4.8.

Why would you say that?

I use DeepSeek V4 Flash (high) and MiMo 2.5 (non Pro, because vision) to work on medium sized projects (~1mil lines of code, C#, Go, TypeScript) with great success.

And that is coming from someone who used Opus 4.7 and GPT 5.5 as workhorses before.

And I'm pretty sure GLM 5.2 is better than the lighter models I use.

My worflow is simple: plan -> clarify -> implement.

1) plan prompt template: I describe what I need and ask LLM to generate a markdown file containing an implementation plan plus at least 10 clarification questions for me to answer.

2) I answer the questions in the plan.md file.

3) implementation prompt template: I ask LLM to implement plan.md and tell me at the end if there were any deviations and new findings during the implementation (there ofter are).