Hacker News

Has anyone noticed that models are dropping ever faster, with pressure on companies to make incremental releases to claim the pole position, yet making strides on benchmarks? This is what recursive self-improvement with human support looks like.

emp17344 8 hours ago [ - ]

Remember when ARC 1 was basically solved, and then ARC 2 (which is even easier for humans) came out, and all of the sudden the same models that were doing well on ARC 1 couldn’t even get 5% on ARC 2? Not convinced these benchmark improvements aren’t data leakage.

culi 3 hours ago [ - ]

Look at the ARC site. The scores of these models is plotted against their "cost per task". All of these huge jumps come along with massive increases in cost per task. Including Gemini 3.1 Pro which increased by 4.2x

casey2 4 hours ago [ - ]

ARC 2 was made specifically to artificially lower contemporary LLM scores, therefore any kind of model improvements will have outsized effects

Also people use "saturated" too liberally. The top left corner 1 cent per task is saturated IMO. Since there are billions of people who would perfer to solve arc 1 tasks at 52 cents per task. Arc 2 a human would make thousands of dollars a day with 99.99% accuracy

z3t4 3 hours ago [ - ]

How much do I get if I solve this? :D

https://arcprize.org/play

alisonkisk 4 hours ago [ - ]

You are saying something interesting but too esoteric. Can you explain for beginners?

redox99 8 hours ago [ - ]

I don't think there's much recursive improvement yet.

I'd say it's a combination of

A) Before, new model releases were mostly a new base model trained from scratch, with more parameters and more tokens. This takes many Months. Now that RL is used so heavily, you can make infinitely many tweaks to the RL setup, and in just a month get a better model using the same base model.

B) There's more compute online

C) Competition is more fierce.

m_ke 7 hours ago [ - ]

this is mostly because RLVR is driving all of the recent gains, and you can continue improving the model by running it longer (+ adding new tasks / verifiers)

so we'll keep seeing more frequent flag planting checkpoint releases to not allow anyone to be able to claim SOTA for too long

culi 3 hours ago [ - ]

I feel like they're actually dropping slower. Chinese models are dropping right before lunar new year as seems to be an emerging tradition.

A couple of western models have dropped around the same time too but I don't think the "strides on benchmarks" are that impressive when you consider how much tokens are being spent to make those "improvements". E.g. Gemini 3.1 Pro's ARC-AGI-2 score went from 33.6% to 77.1% buuut their "cost per task" also increased by 4.2x. It seems to be the same story for most of these benchmark improvements and similar for Claude model improvements.

I'm not convinced there's been any substantial jump in capabilities. More likely these companies have scaled their datacenters to allow for more token usage

ankit219 6 hours ago [ - ]

not much to do with self improvement as such. openai has increased its pace, others are pretty much consistent. Google last year had three versions of gemini-2.5-pro each within a month of each other. Anthropic released claude 3 in march 24, sonnet 3.5 in june 24, 3.5 new in oct 24, and then 3.7 in feb 25, where they went to 4 series in May 25. then followed by opus 4.1 in august, sonnet 4.5 in oct, opus 4.5 in nov, 4.6 in feb, sonnet 4.6 in feb itself. Yes, they released both within weeks of each other, but originally they only released it together. This staggered release is what creates the impression of fast releases. its as much a function of training as a function of available compute, and they have ramped up in that regard.

oliveiracwb 7 hours ago [ - ]

With the advent of MoEs, efficiency gains became possible. However, MoEs still operate far from the balance and stability of dense models. My view is that most progress comes from router tuning based on good and bad outcomes, with only marginal gains in real intelligence

ainch 3 hours ago [ - ]

It's becoming impossible to keep up - in the last week or so we've had: Gemini 3 Deep Think, Gemini 3.1 Pro, Claude Sonnet 4.6, GPT-5.3-Codex Spark, GLM-5, Minimax-2.5, Step 3.5 Flash, Qwen 3.5 and Grok 4.20.

and I'm sure others I've missed...

nikcub 7 hours ago [ - ]

and anyone notice that the pace has broken xAI and they were just dropped behind? The frontier improvement release loop is now ant -> openai -> google

gavinray 5 hours ago [ - ]

xAI just released Grok 4.20 beta yesterday or day before?

dist-epoch 6 hours ago [ - ]

Musk said Grok 5 is currently being trained, and it has 7 trillion params (Grok 4 had 3)

svara 5 hours ago [ - ]

My understanding is that all recent gains are from post training and no one (publicly) knows how much scaling pretraining will still help at this point.

Happy to learn more about this if anyone has more information.

dist-epoch 4 hours ago [ - ]

You gain more benefit spending compute on post-training than on pre-training.

But scaling pre-training is still worth it if you can afford it.

gmerc 6 hours ago [ - ]

That's what scaling compute depth to respond to the competition look like, lighting those dollars on fire.

toephu2 6 hours ago [ - ]

This is what competition looks like.

7 hours ago [ - ]

[deleted]

PlatoIsADisease 8 hours ago [ - ]

Only using my historical experience and not Gemini 3.1 Pro, I think we see benchmark chasing then a grand release of a model that gets press attention...

Then a few days later, the model/settings are degraded to save money. Then this gets repeated until the last day before the release of the new model.

If we are benchmaxing this works well because its only being tested early on during the life cycle. By middle of the cycle, people are testing other models. By the end, people are not testing them, and if they did it would barely shake the last months of data.

KoolKat23 3 hours ago [ - ]

I have a relatively consistent task that it completed with new information on weekdays at the edge of its intelligence. Interestingly 3.0 flash was good when it came out, took a nose dive a month back and is now excellent, I actually can't fault it it's so good.

It's performance in antigravity has also actually improved since launch day where it was giving non-stop typescript errors (not sure if that was antigravity itself).

boxingdog 7 hours ago [ - ]

[dead]