IMO "thinking" here means "computation", like running matrix multiplications. Another view could be: "thinking" means "producing tokens". This doesn't require any proof because it's literally what the models do.
As I understand it, the claim is: more tokens = more computation = more "thinking" => answer probably better.
I don't agree with GP's take on anthropomorphising[0], but in this particular discussion, I meant something even simpler by "thinking" - imagine it more like manually stepping a CPU, or powering a machine by turning a crank. Each output token is kinda like a clock signal, or a full crank turn. There's lots of highly complex stuff happening inside the CPU/machine - circuits switching/gears turning - but there's a limit of how much of it can happen in a single cycle.
Say that limit is X. This means if your problem fundamentally requires at least Y compute to be solved, your machine will never give you a reliable answer in less than ceil(Y/N) steps.
LLMs are like this - a loop is programmed to step the CPU/turn the crank until the machine emits a magic "stop" token. So in this sense, asking an LLM to be concise means reducing the number of compute it can perform, and if you insist on it too much, it may stop so early as to fundamentally have been unable to solve the problem in computational space allotted.
This perspective requires no assumptions about "thinking" or anything human-like happening inside - it follows just from time and energy being finite :).
--
[0] - I strongly think the industry is doing a huge disservice avoiding to anthropomorphize LLMs, as treating them as "little people on a chip" is the best high-level model we have for understanding their failure modes and role in larger computing systems - and instead, we just have tons of people wasting their collective efforts trying to fix "lethal trifecta" as if it was a software bug and not fundamental property of what makes LLM interesting. Already wrote more on it in this thread, so I'll stop here.