Benchmarks seem like a fools errand at this point; overly tuning models just to specific test already published tests, rather than focusing on making them generalize.

Hugging face has a leader board and it seems dominated by models that are finetunings of various common open source models, yet don't seem be broader used:

https://huggingface.co/open-llm-leaderboard

There are quite a few benchmarks for which that's not the case:

- live benchmarks (livebench, livecodebench, matharena, SWE-rebench, etc)

- benchmarks that do not have a fixed structure, like games or human feedback benches (balrog, videogamebench, arena)

- (to some extent) benchmark without existing/published answers (putnambench, frontiermath). You could argue that someone could hire people to solve those or pay off benchmark dev, but it's much more complicated.

Most of the benchmarks that don't try to tackle future contamination are much less useful, that's true. Unfortunately, HLE kind of ignored it (they plan to add a hidden set to test for contamination, but once the answers are there, it's a lost game IMHO); I really liked the concept.

Edit: it is true that these benchmarks are focusing only on a fairly specific subset of the model capabilities. For everything else vibe check is your best bet.

I agree with you.

Of course, some benchmarks are still valid and will remain valid. Ie. we can make the models play chess against each other and score them on how well they do. But those benchmarks are in general fairly narrow. They don't really measure the "broader" intelligence we are after. And often, LLMs perform worse than specialized models. Ie. I don't think there is any LLM out there that can beat a traditional chess program (surely not using the same computing power).

What is really bad are the QA benchmarks which leak over time into the training data of the models. And sometimes, one can suspect even big labs have an economic incentive in scoring well on popular benchmarks which cause them to manipulate the models way beyond what is reasonable.

And taking a bunch of flawed benchmarks and combining them in indexes, saying this model is 2% better than that model is just completely meaningless but of course fun and draws a lot of attention.

So, yes, we are kind of left with vibe checks, but in theory, we could do more; take a bunch of models, double-blind, and have a big enough, representative group of human evaluators score them against each other on meaningful subjects.

Of course, done right, that would be really expensive. And those sponsoring might not like the result.

> But those benchmarks are in general fairly narrow. They don't really measure the "broader" intelligence we are after.

I think a general model that can

- finish nethack, doom, zelda and civilization,

- solve the hardest codeforces/atcoder problems,

- formally prove putnam solution with high probability, not given the answer

- write a PR to close a random issue on github

is likely to have some broader intelligence. I may be mistaken, since there were tasks in the past that appeared to be unsolvable without human-level intelligence, but in fact weren't.

I agree that such benchmarks are limited to either environment with well-defined feedback and rules (games) or easily verifiable ones (code/math), but I wouldn't say it's super narrow, and there are no non-LLM models to perform significantly better on these (except some games); though specialized LLMs work better. Finding other examples, I think, is one of the important problems in AI metrology.

> So, yes, we are kind of left with vibe checks, but in theory, we could do more; take a bunch of models, double-blind, and have a big enough, representative group of human evaluators score them against each other on meaningful subjects.

You've invented an arena (who just raised quite a lot of money). Can argue about "representative," of course. However, I think the SNR in the arena is not too high now; it turns out that the average arena user is quite biased, the most of their queries are trivial for LLMs, and for non-trivial ones, they cannot necessarily figure out which answer is better. MathArena goes in opposite directions: narrow domain, but expert evaluation. You could imagine a bunch of small arenas, each with its own domain experts. I think it may happen eventually if money flow into AI continues.

A couple of things:

I wasn't trying to invent anything. Just describing what you would obviously have to do if you were to take a "scientific" or "objective" approach: Sound experiments, reproducible, free of financial incentives.

As far as I can tell, no one is doing that at a significant scale. Everything is buried in hype and marketing.

Now for that broad set of benchmarks (PRs to GitHub, Putnam, Zelda). There is something to that, but it depends on the model. A lot of what is out there are “mixtures of experts" either by implicit or explicit design. So there is a mechanism that looks at the problem and then picks the subsystem to delegate it to. Is it a game of chess - boot up the chess program? Is it poetry? Boot up the poetry generator.

That sort of thing is not showing broad intelligence anymore than a person both knowing a chess player and a poet is having broad intelligence.

Deepseek is, as far as I can tell, the leading open-source model; and in some way, that makes it the leading model. I don't think you can fairly compare a model that you can run locally with something that is running behind a server-side API - because who knows what is really going on behind the API.

Deepseek being Chinese makes it political and even harder to have a sane conversation about; but I am sure that had it been China that did mostly closed models and the US that did open ones; we would hold that against them, big time.

So there is a mechanism that looks at the problem and then picks the subsystem to delegate it to. Is it a game of chess - boot up the chess program? Is it poetry? Boot up the poetry generator.

No, that's not actually a good description of the mixture-of-experts methodology. It was poorly named. There is no conscious division of the weights into "This subset is good for poetry, this one is best for programming, this one for math, this one for games, this one for language translation, etc."

> I wasn't trying to invent anything. Just describing what you would obviously have to do if you were to take a "scientific" or "objective" approach: Sound experiments, reproducible, free of financial incentives.

But how is it different from what arena or matharena does?

> That sort of thing is not showing broad intelligence anymore than a person both knowing a chess player and a poet is having broad intelligence.

The claim is that these problems require somewhat broad intelligence by themselves, as opposed to specialization into specific task while unable to do anything else.

right, all benchmarks collapse once you go beyond 32K tokens. I've rarely seen any benchmarks focusing on long range, which is where most programming needs are at.

The only benchmarks that match my experience with different models are here https://livebench.ai/#/

livebench was good, but now it's a joke. Gemini flash is better in coding than pro and sonnet 3.7. And this is only the beginning of weird results.

Flash is better than Pro in coding? Whoa... [makes a note to try a few things later this day]

Out of curiosity, how did you gauge that?

I think your parent comment is citing that as an example of why livebench is no longer a good benchmark. That said, the new Flash is very good for what it is, and IMO after the Pro 05-06 nerfs the two models are much closer in performance for many tasks than they really should be — Pro should be / was way better (RIP 03-25 release). That livebench result may be wrong about the specific ranking, but I think it's right that Flash is in the same class of coding strength as Sonnet 3.7.

Thanks, that's very informative.

My ignorance is showing here: why is the Pro 05-06 a nerf?

[deleted]

>overly tuning models just to specific test already published tests, rather than focusing on making them generalize.

I think you just described SATs and other standardized tests

SAT has a correlation to IQ of 0.82 to 0.86 and I do think IQ is very useful in judging intelligence.

https://gwern.net/doc/iq/high/smpy/2004-frey.pdf

It's a useful diagnostic when used in a battery of diagnostic tests of cognitive function, but to the point of this thread: it is notoriously not a good ranking mechanism.

Artificial Analysis is the only stable source. Don't look at others like HF Leaderboard.

https://artificialanalysis.ai/