The thing that TFA doesn't seem to go into is that these mathematical results apply to human agents in exactly the same way as they do to AI agents, and nevertheless we have massive codebases like Linux. If people can figure out how to do it, then there's no math that can help you prove that AIs can't.

Ive yet to see a human process which used an excessive number of cheap junior developers precisely architected to create high quality software.

If that could have been achieved it would have been very profitable, too. There's no shortage of cheap, motivated interns/3rd world devs and the executive class prefer to rely on disposable resources even when it costs more overall.

The net result was always the opposite though - one or two juniors on a leash could be productive but more than that and it always caused more problems than it solved.

Seeing the same problem with agents. Multi agent orchestration seems like a scam to manufacture demand for tokens.

I'm in absolute agreement that the AI coordination problem exists today when the AI is at junior level. I'm just saying that the mathematical argument is silly to apply to arbitrary future AIs, if and when they reach human capability. Because while coordination problems have not been mathematically solved, the world economy is a case in point that it is possible to coordinate human-level agents to achieve large scale projects at generally sufficient quality levels.

So to be clear, I'm not advising anyone to change their current token consumption habit. I'm just saying that it's silly to apply math to prove the impossibility of something we can literally see around us. It's like a mathematical proof that water isn't really wet.

Humans can also count the number of Rs in strawberry, but good luck proving that mathematically

LLMs don't see words. They see tokens, which is why previously they had a hard time counting the r's.

You can certainly prove that mathematically, and giving that proof to an LLM it will give you the correct answer. Which is a prompting technique btw to improve accuracy of an LLMs results.

Although the more recent LLMs have built in approaches to answer this question.

Gemma4 for example:

-----

The user is asking for the number of 'r's in the word "strawberry".

* S

* T

* *R* (1)

* A

* W

* B

* E

* *R* (2)

* *R* (3)

* Y

There are 3 'r's.

-----

GPT-OSS-120b example.

-----

The user asks: "How many r's in strawberry." This is a straightforward counting problem: count the letter 'r' in the word "strawberry". The word "strawberry" contains letters: s t r a w b e r r y. Count r's: there are three 'r's? Let's check: s(1) t(2) r(3) a(4) w(5) b(6) e(7) r(8) r(9) y(10). Actually the word is "strawberry". Let's write out: s, t, r, a, w, b, e, r, r, y. So there are three r's: at positions 3, 8, 9. Yes, three r's. So answer: 3.

But we may also note that "r" appears three times.

Thus answer: 3.

We can provide a short answer.

Thus final: There are three r's in "strawberry".

----

Doubt if you can make a dumb model smart by feeding it proofs

https://www.promptingguide.ai/techniques/knowledge

Sohnds like a great way to fill up the context before you even start.

Yes, what's your point? That is literally what it does - it adds relevant knowledge to the prompt before generating a response, in order to ground it me effectively.

My point is that this doesn't scale. You want the LLM to have knowledge embedded in its weights, not prompted in.

It scales fine if done correctly.

Even with the weights the extra context allows it to move to the correct space.

Much the same as humans there are terms that are meaningless without knowing the context.

I doubt it is possible to mathematically prove much inside of a black box of billions of interconnected weights. But at least in the narrow case of the strawberry problem, it seems likely that LLM inference could reliably recognizing that sort of problem as the type that would benefit from a letter counting tool call as part of the response.