So this year SotA models have gotten gold at IMO, IoI, ICPC and beat 9/10 humans in that atcoder thing that tested optimisation problems. Yet the most reposted headlines and rethoric is "wall this", "stangation that", "model regression", "winter", "bubble", doom etc.

In 2015 SotA models blew past all expectations for engine performance in Go, but that didn't translate into LLM-based Code agents for another ~7 years (and even now the performance of these is up for debate). I think what this shows is that humans are extremely bad at understanding what problems are "hard" for computers; or rather we don't understand how to group tasks by difficulty in a generalizable way (success in a previously "hard" domain doesn't necessarily translate to performance in other domains of seemingly comparable difficult). It's incredibly impressive how these models perform in these contests, and certainly demonstrates that these tools have high potential in *specific areas* , but I think we might also need to accept that these are not necessarily good benchmarks for these tools' efficacy in less structured problem spaces.

Copying from a comment I made a few weeks ago:

> I dunno I can see an argument that something like IMO word problems are categorically a different language space than a corpus of historiography. For one, even when expressed in English language math is still highly, highly structured. Definitions of terms are totally unambiguous, logical tautologies can be expressed using only a few tokens, etc. etc. It's incredibly impressive that these rich structures can be learned by such a flexible model class, but it definitely seems closer (to me) to excelling at chess or other structured game, versus something as ambiguous as synthesis of historical narratives.

edit: oh small world! the cited comment was actually a response to you in that other thread :D

> edit: oh small world the cited comment was actually a response to you in that other thread :D

That's hilarious, we must have the same interests since we keep cross posting :D

The thing with the go comparison is that alphago was meant to solve go and nothing else. It couldn't do chess with the same weights.

The current SotA LLMs are "unreasonably good" at a LOT of tasks, while being trained with a very "simple" objective: NTP. That's the key difference here. We have these "stochastic parrots" + RL + compute that basically solve top tier competitions in math, coding, and who knows what else... I think it's insanely good for what it is.

> I think it's insanely good for what it is.

Oh totally! I think that the progress made in NLP, as well as the surprising collision of NLP with seemingly unrelated spaces (like ICPC word problems) is nothing sort of revolutionary. Nevertheless I also see stuff like this: https://dynomight.substack.com/p/chess

To me this suggests that this out-of-domain performance is more like an unexpected boon, rather than a guarantee of future performance. The "and who knows what else..." is kind of I'm getting: so far we are turning out to be bad at predicting where these tools are going to excel or fall short. To me this is sort of where the "wall" stuff comes from; despite all the incredible successes in these structured problem domains, nobody (in my personal opinion) has really unlocked the "killer app" yet. My belief is that by accepting their limitations we might better position ourselves to laser-target LLMs at the kind of things they rule at, rather than trying to make them "everything tools".

A lot of the current code and science capabilities do not come from NTP training.

Indeed in seems in most language model RL there is not even process supervision, so a long way from NTP

Even Sam Altman himself thinks we’re in a bubble, and he ought to have a good sense of the wind direction here.

I think the contradiction here can be reconciled by how these tests don’t tend to run on the typical hardware constraints they need to be able do this at scale. And herein lies a large part of the problem as far as I can tell; in late 2024, OpenAI realized they had to rethink GPT-5 since their first attempt became too costly to run. This delayed the model and when it finally released, it was not a revolutionary update but evolutionary at best compared to o3. Benchmarks published by OpenAI themselves indicated a 10% gain over o3 for God knows how much cash and well over a year of work. We certainly didn’t have those problems in 2023 or even 2024.

DeepSeek has had to delay R2, and Mistral has had to delay Mistral 3 Large, teased within weeks back in May. No word from either about what’s going on. DS is said to move more to Huawei and this is behind a delay but I don’t think it’s entirely clear it has nothing to do with performance issues.

It would be more strange to _not_ have people speculate about stagnation or bubbles given these events and public statements.

Personally, I’m not sure if stagnation is the right word. We’re seeing a lot,of innovation in toolsets and platforms surrounding LLM’s like Codex, Claude Code, etc. I think we’ll see more in this regard and that this will provide more value than the core improvements to the LLM’s themselves in 2026.

And as for the bubble, I think we are in one but mostly because the market has been so incredibly hot. I see a bubble not because AI will fall apart but because there are too many products and services right now in a golden rush era. Companies will fail but not because AI suddenly starts failing us but due to saturation.

Sam Altman proclaiming we are in a bubble benefits him. It lowers the price of potential targets for acquisitions. I bet you didnt think of that did you?

it was not a revolutionary update but evolutionary at best compared to o3

It is a revolutionary update if compared to the previous major release (GPT-4 from March 2023).

There is a clear difference between what OpenAI manages to do with GPT-5 and what I manage to do with GPT-5. The other day I asked for code to generate a linear regression and it gave back a figure of some points and a line through it.

If GPT-5, as claimed, is able to solve all problems in ICPC, please give the instructions on how I can reproduce it.

I believe this is going to be an increasingly important factor.

Call it the “shoelace fallacy”: Alice is supposedly much smarter but Bob can tie his shoelaces just as well.

The choice of eval, prompt scaffolding, etc. all dramatically impact the intelligence that these models exhibit. If you need a PhD to coax PhD performance from these systems, you can see why the non-expert reaction is “LLMs are dumb” / progress has stalled.

Yeah, until OpenAI says "we pasted the questions from ICPC into chatgpt.com and it scored 12/12" the average user isn't really going to be able to reproduce their results.

The average user will never need to answer ICPC questions though.

No, but the average users have things they want to do that require ICPC level problem solutions. Like making optimized games etc, average users wants that for sure.

the average person doesnt need to do that. The benchmark for "is this response accurate and personable enough" on any basic chat app has been saturated for at least a year at this point.

Are you using the thinking model or the non thinking model? Maybe you can share your chat.

I prefer not to due to privacy concerns. Perhaps you can try yourself?

I will say that after checking, I see that the model is set to "Auto", and as mentioned, used almost 8 minutes. The prompt I used was:

    Solve the following problem from a competitive programming contest. Output only the exact code needed to get it to pass on the submission server.
It did a lot of thinking, including

   I need to tackle a problem where no web-based help is available. The task involves checking if a given tree can be the result of inserting numbers 1 to n into an empty skew heap, following the described insertion algorithm. I have to figure out the minimal and maximal permutations that produce such a tree.
And I can see that it visited 13 webpages, including icpc, codeforces, geeksforgeeks, github, tehrantimes, arxiv, facebook, stackoverflow, etc.

A terse prompt and expecting a one-shot answer is really not how you'd get an LLM to solve complex problems.

I don't know what Deepmind and OpenAI did in this case, but to get an idea of the kind of scaffolding and prompting strategy that one might want, have a look at this paper where some floks used the normal generally available Gemini Pro 2.5 to solve 5/6 of the 2025 IMO problems: https://arxiv.org/pdf/2507.15855

The point of the GPT-5 model is that it is supposed to route between thinking/nonthinking smartly. Leveraging prompt hacks such as instructing it to "think carefully" to force routing to the thinking model go against OpenAI's claims.

Just select GPT5-thinking if you need anything done with competence. The regular gpt5 is nothing impressive and geared more towards regular daily life chatting.

Are you sure? I thought you can only specify reasoning_effort and that's it.

If you can't get a modern LLM to generate a simple linear regression I think what you have is a problem between the keyboard and the chair...

My response simply is that performance in coding competitions such as ICPC is a very different skillset than what is required in a regular software engineering job. GPT-5 still cannot make sense of my company's legacy codebase even if asked to do the most basic tasks that a new grad out of college can figure out in a day or two. I recently asked it to fix a broken test (I had messed with it by changing one single assertion) and it declared "success" by deleting the entire test suite.

This. Dealing with the problems of a real-world legacy code base is the exact opposite of a perfectly constrained problem, verified for internal consistency probably by computers and humans, of all things, and presented neatly in a single PDF. There are dozens, if not 100s, of assumptions that humans are going to make while solving a problem (i.e., make sure you don't crash the website on your first day at work!) that an LLM is not going to. Similar to why, despite all its hype, Waymo cars are still being supervised by human drivers nearly 100% of the time and can't even park themselves regularly without stalling with no explanation.

>Waymo cars are still being supervised by human drivers nearly 100% of the time

That seems...highly implausible?

I mean that a human is ready to jump in at any point an "exception" happens.

Example: During parking, which I witness daily in my building, it happens all the time.

1. Car gets stuck trying to park, blocking either the garage or a whole SF street 2. A human intervenes, either in person (most often) or seemingly remotely, to get the car unstuck.

I'm not in the US and have never seen a self-driving car.

Can you explain how a human intervenes in person?

Do you mean these cars have a human driver on board? Or the passenger drives? Or another car drops off a driver? Or your car park is such an annoying edge case that a driver hangs around there all the time just to help park the cars?

Similar experience with windsurf.

I had a class of 5 or so test methods - ABCDE. I asked it to fix C, so it started typing out B token-by-token underneath C, such that my source file was now ABCBDE.

I don't think I'm smart enough to get it to do coding activities.

> it declared "success" by deleting the entire test suite.

The paperclip trivial solution!

> So this year SotA models have gotten gold at IMO, IoI, ICPC > Yet the most reposted headlines and rethoric is "wall this", "stangation that", "model regression", "winter", "bubble", doom etc.

this is narrow niche with high amount of training data (they all buy training data from leetcode), and this results are not necessary generalizable on overall industrial tasks

People pattern match with a very low-resolution view of the world (web3/crypt/nfts were a bubble because there was hype, so there must be a bubble since AI is hyped! I am very smart) and fail to reckon with the very real ways in which AI is fundamentally different.

Also I think people do understand just how big of a deal AI is but don't want to accept it or at least publicly admit it because they are scared for a number of reasons, least of all being human irrelevance.

Historically there has been a gap between the performance of AI in test environments vs the impact in the real world, and that makes people who have been through the cycle a few times cautious extrapolating.

In 2016 Geoffrey Hinton said vision models would put radiologists out of business within 5-10 years. 10 years on there is a shortage of Radiologists in the US and AI hasn't disrupted the industry.

The DARPA grand challenge for autonomous vehicles was won in 2006, 20 years on self driving cars still have limited deployment.

The real world is more complex than computer scientists apprecate.

Two days ago I talked to someone in water management about data centers. One of the big players wanted to build a center that consumed as much water as a medium town in semi arid bushland. A week before that it was a substation which would take a decade to source the transformers for. Before that it was buying closed down coal power plants.

I don't know if we're in a bubble for model capabilities, but we are definitely hitting the wall in terms of what the rest of the physical economy can provide.

You can't undo 50 years of deffered maintenance in three months.

Getting well funded commercial demand is exactly how you undo it.

Not in three months. It will take years if not decades.

What happens when OpenAI and friends go bust because China is drowning in spare grid capacity and releasing sota open weights models like R1 every other week?

Every company building infrastructure for AI also goes out of business and we are in a worse position than we are now because instead of having a tiny industry building infrastructure at a level required to replace what has reached end of life we have nothing.

Well, the supposed PhD-level models are still pretty dumb when they get to consumers, so what gives?

The last time I asked for a code review from AI was last week. It added (hallucinated) some extra lines to the code and then marked them as buggy. Yes, it beats humans at coding — great!

What's "It?" What was your prompt?

It's important to look closely at the details of how these models actually do these things.

If you look at the details of how Google got gold at IMO, you'll see that AlphaGeometry only relies on LLMs for a very specific part of the whole system, and the LLM wasn't the core problem solving system in play.

Most of AlphaGeometry is standard algorithms at play solving geometry problems using known constraints. When the algorithmic system gets stuck, it reaches out to LLMs that were fine tuned specifically for creating new geometric constraints. So the LLM would create new geometric constraints and pass that back to the algorithmic parts to get it unstuck, and repeat.

Without more details, it's not clear if this win is also the Gpt-5 and Gemini models we use, or specially fine-tuned models that are integrated with other non-LLM and non-ML based systems to solve these.

Not being solved purely by LLM isn't a knock on it, but with the current conversations going on today with LLMs, these are heavily being marketed as "LLMs did this all by themselves", which doesn't match with a lot of the evidence I've personally seen.

>This achievement is a significant advance over last year’s breakthrough result. At IMO 2024, AlphaGeometry and AlphaProof required experts to first translate problems from natural language into domain-specific languages, such as Lean, and vice-versa for the proofs. It also took two to three days of computation. This year, our advanced Gemini model operated end-to-end in natural language, producing rigorous mathematical proofs directly from the official problem descriptions – all within the 4.5-hour competition time limit.

[1]https://deepmind.google/discover/blog/advanced-version-of-ge...

3 days of computation is crazy and definitely not on par with human contestants.

AlphaGeometry/AlphaProof (the one you're thinking of, where they used LLMs + lean) was last year! and they "only" got silver. IMO gold results this year were e2e NLP.

"We used a custom AI that requires a small nuclear plant to be trained and function to beat three humans consuming 400 watts per day" isn't as impressive as it sounds

Where these competitions differ from real life is that evaluating a solution is much easier than generating a solution. We're at the point where AI can do a pretty good job of evaluating solutions, which is definitely an impressive step. We're also at the point where AI can generate candidate solutions to problems like these, which is also impressive. But the degree to which that translates to practical utility is questionable.

The sibling commenter compared this to go, but we could go back to comparing it with chess. Deepblue didn't play chess the way a human did. It deployed massive amounts of compute, to look at as many future board states as possible, in order to see which move would work out. People who said that a computer that could play chess as well as a human would be as smart as a human ended up eating crow. These modern AIs are also not playing these competitions the way a human does. Comparing their intelligence to that of a humans is similarly fallacious.

This comment makes me think. What did previous winners of these competition go on to do in their lives? Anything spectacular?

Indeed.

I personally view all this stuff as noise. Im more interested in seeing any contributions to the real economy. Not some competition stuff that is irrelevant to the welfare of people.

the wall is how we need to throw trillions of hardware to do "breakthroughs", LLM uses the same algorthm from last few years. We need a new algorthm breakthrough otherwise buying hardware to increase intelligence isn't scalable.

Don't worry, they're just stochastic parrots copying answers from Stack Overflow. ;)

People are having a tough time coping with what the near future holds for them. It is quite hard for a typical person to imagine how disruptive and exponential coming world events are like Covid showed.