>The model still does its full reasoning before generating the response, caveman just affects how the final response is formatted

Right, and that final response forms the latest context for your next follow-up prompt. Not having that final reasoning laid out in the conversation history leaves a huge gap in successive reasoning. I remember playing around with this idea in the Sonnet 3.x days and it was immediately obvious how the ability to handle long running tasks degraded. If you are just doing single-shot work for some reason, sure, but that's not what most real world usage looks like these days.

I don't know how Claude and such do it, but latest Qwen model supports preserving reasoning between calls, which based on what I heard does help a fair bit.

Qwen continues to surprise and outshine. It's been an enjoyable unexpected new player, especially this past month!