The problem with these benchmarks is that the Chinese models tend to be incredible on paper, and absolutely terrible in practice :/
The problem with these benchmarks is that the Chinese models tend to be incredible on paper, and absolutely terrible in practice :/
This was a problem with older Qwen/MiMo/Kimi models mostly. GLM has always been on the more robust side, and newer iterations from all those labs have improved as well. The only lab I've seen regressing this way is DeepSeek, 3.2 was fairly robust but 4.0 feels more benchmaxxed.
I have used GLM since version 4.8 I think and do enjoy using them. More then other models like Kimi or Deepseek. Though only tested them on smaller private projects.
> I have used GLM since version 4.8 I think
You probably refer to GLM-4.7
I beg to differ. I replaced a $40/mo GitHub Copilot subscription where I used Opus 4.6 and GPT 5.5 with a $10/mo opencode Go plan where I use mostly DeepSeek V4 Flash and testing MiMo 2.5.
I work on mid-sized projects currently (200k to 1kk lines of code).
> 1kk lines of code
Isn't that a million?
Yep. I consider up to a million lines of code as mid-sized.
When I worked in banking, the codebases were often larger than a million.
You are obviously lying because it shows you have no experience with. GLM since 4.5 have been crushing it. all their models since then haven't skipped a beat. 4.5/4.5-air, 4.6, 4.7, 4.8, 5, 5.1. That aside, MiMoV2.5, MiniMax from 2.0, DeepSeek from V3, Kimi since V2, Qwen since 3, Hy3 have all been amazing models. All from China, we need to get over it. China is not losing yet as far as the AI race is concerned.
Is there a GLM-4.8 model?
[flagged]