I just tested GLM 5.2 out via Z.ai in pi for a little one-off project that was already scoped. It actually did a relatively decent job starting out, and figured important things out from context.

But the reasoning traces became increasingly hilarious, with it getting confused and going in loops, doubting itself. I began to feel almost sad, it was like listening to the internal monologue of someone with anxiety disorder.

It made pretty good progress but wound up going in a lot of goofy loops and doing things a bit "off" from standards I'd hoped it would infer, and finally started going a bit nuts, "This is very confusing.", "OH WAIT", seemingly hallucinating a whole side-quest that didn't make sense and looking at making internal system changes to try to achieve its (now very confused) goal when I pulled the plug.

Without seeing the reasoning traces from Claude/GPT it's hard to really know, but it definitely didn't feel like the same quality of reasoning, even if dogged persistence does wind up actually working eventually.

The reasoning traces always look terrible and they’re frustrating to watch. It’s the same with Kimi. What’s interesting is that the end result is then good. I think it’s just some sort of devils advocate trick to get better output.

The reasoning tokens are really just there to extend the amount the LLM can "compute" the problem; put another way, the only way a given model can "think" more about a problem is to fill more of its context with predicted tokens, which has the effect of increasing the accuracy of each token. The reinforcement learning these models go through generally doesn't care what the chain of thought tokens look like (outside of preventing loops/gibberish/reward hacking), only how good the final answer is - so while it does look something like "reasoning" to us and has a rough correlation with the final answer, treating it as actually representative of what the final answer will be or an actual thought process is giving those tokens too much credit :)

For me what really drove this point home (that reasoning traces aren't "real" by any reasonable definition of the term) was noticing instances of things being out of order and exhibiting various inconsistencies with the final answer. My favorite was an example posted to HN that went something along the lines of the model first output the conclusion, then performed the supposed derivation after the fact, then stated it needed to verify the earlier conclusion to verify the derivation was correct so it hallucinated a tool call, then it remarked positively about the verification matching, and finally it output a slightly different answer. At no point was the answer actually correct although it was vaguely in the ballpark.

as compared to what though? you can't see the actual think traces for opus or gpt.

Compared to what comes out at the end. Like if you sit there watching Kimi k2.6 "think", you're like "what? no you fucking idiot!" and you get this urge to "steer" it and so on, but very rarely is that steering actually necessary, it just winds up popping out the correct answer and all of those 'Wait! That's it! I found it! Actually ... Let me just' is just whatever internal processing it needed to use to get to the correct response. Mostly likely it's just being self-adversarial and exploring a bunch of dumb avenues to isolate the best outcome with the highest probability

thinkslop recursion.

I have a hilarious theory why GLM (and Kimi) have this thinkslop,

apparently Chinese language as token is more information dense than English, so having these wasteful thinkslop in Mandarin isnt that damaging. So the developer focus mostly in Mandarin and didnt think of handling these thinkslop while American AI labs do.

I think the self-doubt might actually be a very crucial part of it's capability. I often feel compelled to interrupt when I'm watching it think (which thank the stars it let's us do, unlike the big American models!!), but usually it makes the right pick!

Being willing and able to reconsider seems very good. Going around and around, pulling in more thinking, integrating it: maybe that's why it is as good as it's good.

I want to emphasize again how excellent it is that we can see the thinking. I think this makes GLM so much better an experience for me. It gives me such insight into what is being considered, helps me see where things go wrong. It grounds me, gives me the notion of where the results come from. It was so jarring to switch to GPT and Opus and find that they won't discuss with me, won't reveal their thinking: that feels fundamentally unsafe, for me, for society, to have such a severe black box. I don't think it should be allowed, honestly.

Many thanks to this recent submission, which is the first time I've seen anyone blog about this core difference: The text in Claude Code’s “Extended Thinking” output is not authentic. https://patrickmccanna.net/the-text-in-claude-codes-extended... https://news.ycombinator.com/item?id=48630535

Your post made me laugh because I experienced the same as you but the other way around. I switched from Claude to a multi model harness a couple of days ago and the first model I tried was GLM5.2.

I gave it some simple code porting exercises and watched dumbfounded at the reasoning, which was more like the ravings of a lunatic - but lo and behold, after much confusion and a dizzying number of eureka moments the task was completed very successfully.

I tried Kimi on a similar task, much faster, a little more reassuring somehow in its ramblings, also surprisingly good results.

To be clear, I’m not surprised the results were good because they’re not GPT or Claude, but because the line of reasoning was so bonkers. Coming from Claude, I was just not used to seeing this, but I’ll bet it’s just as nuts with the frontier models and we’re just not allowed to see it (I’m about to read the links you shared).

Agree wholeheartedly that transparency is of grave importance.

> Coming from Claude, I was just not used to seeing this

Claude doesn't show its internal Cot?

If you look at the "thinking" traces as ways of expressions of uncertainty rather than literal thinking they make more sense.

Consider debugging - you start off in one place, think you have worked out what is happening, and then there is a "oh but what about xxx" thing that happens and you explore another branch. Then you "have it for sure" until you find another edge case.

The LLM is doing something analogous. It's writing circuits to try to emulate your program. Each time it gets one that seems right it is very sure that circuit is correct, but then it finds another thing.

At any point you can stop and go "write code now" and it will, and the code will seems fine provided it hasn't hit one of these edge cases.

Turning up thinking time is literally forcing more exploration.

The words that come out are amusingly dramatic, but... TBH when I debug I often are like "WTF" and throwing my hands up in the air at some gotcha I didn't expect.

Yeah isn't that thinking weird?

Now I see the issue clearly! But wait... now I have the full picture! But wait... Found it!

I gave up a few times because of it at first until I realized I just had to let GLM get on with it and what came out was great!

But once it was outright endearing- challenging bug, it said: I have been very thorough. Then it escalated where to look and aced it. Built in confucian values

I'm like 90% sure the harnesses inject those tokens into the ai to make them check their work. Things like "but wait" and "but what if..." etc. Like literally inserting them artificially and then say "carry on from here" to the ai so it's working as though it itself output those words so the ai has an opportunity to turn around if its making a mistake. repeat a bunch of times and we get something useful.

I started noticing those in gh copilot right around when they turned off thinking traces end of last year

If there’s one thing I’ve learned these past couple of days, it’s to resist the temptation to jab the escape button and start waving my arms! I wonder how much of this cyclical self doubt / self congratulating I go through in my own thoughts without even realising it. If you could verbalise or articulate all the half thoughts, snatches of ideas, feelings and ruminations the human mind goes through on some tasks it might be even more bizarre (or could just be me)

[deleted]

[dead]