I agree with your general gist, and in general it’s a “the best tool for the particular job”, keeping token spent and other things in mind as well.

What I do know absolutely for sure is that LLM benchmarks are not to be trusted, they are just a minor indicator and real world usage is often very different.

I share this sense, but my immediate thought is that we need to improve the evaluations! Do you think this is impossible? That there is something indelible that it is not possible to capture empirically? I kind of have this intuitive sense that it is this way, but simultaneously I think that it's unlikely to really be true.

We shouldn't just measure the power of the raw LLM, harnesses matter more and more.

It's like taking the engine out a each car, putting it to a test bed and running it and then making a decision whether the car is good or bad based on the graphs the test bed provided.

You might have the best engine in the world, but if you put it in a shit car, the result is still bad. The seats are squeaky plastic, the infotainment is touch-only and you can't put on your seatbelt without knocking down whatever is in the cupholder.

Aren't there benchmarks that measure at the harness level as well?

How would you benchmark "agent harness communicates with user clearly" it's 100% a feels measurement.

Following the original comment concepts, if every model requires a different prompting technique to maximize its output, how can a benchmark based on sending the same prompt to all models be accurate? We should create different prompts for each model, but then how reliable and unbiased can the benchmark be?

It is a fundamentally hard problem to solve

I'm not GP, but yes, I think it's impossible.

Take AI out of the picture for a moment. What makes someone a good coder? What makes someone intelligent? How do you evaluate those skills?

Of course we have standardized tests, and they're useful, but they're also imperfect. And they become especially imperfect when people start training for the tests specifically—which is, essentially, benchmaxxing.

We have never been able to quantitatively measure most skills to a high degree of accuracy, despite centuries of trying. That's not going to change now.

(I don't mean to anthropomorphize the LLMs, but I do think they're like humans in this way.)

The reason we can’t capture it empirically is that nobody truly knows exactly what we are supposed to be using these tools for or how they are going to operate. We are still fitting squares into holes with them. We are told to treat them like some bespoke tool for coding, shopping, tech-support, etc. But it is not actually purpose built for any of these things.

When I use a calculator, I know exactly what it does and what it is supposed to do. It always gives me a verifiable, predictable result. If I input “8+8” 10,000x it will give me “16” 10,000x outside of incredibly fringe edge cases/bugs. I can’t say the same for LLMs

Yes, how do we know Opus 4.8 hasn't been trained on the SWE-Bench examples?

With a squillion dollars at stake per bench point, someone will have figured out a plausibly deniable way to game these benchmarks.

Ehr, the SWE bench examples are particularly horrible as those are just publicly available historical PRs. So if the models are trained on GitHub data, it will be included.

So almost by design that particular benchmark is tainted, and benchmarks recall rather than reasoning.

Wow that's worse than I thought, and breaks the number one rule of machine learning: you don't train the model with your test dataset.

What would it take to have trustworthy benchmarks? As with all "targets", they can be gamed - but I am curious about quantifiable quality metrics.

Issue with LLM benchmarks is similar to cars’ benchmarks. Eg journalists almost always get the full equipped model so their review is honest but sort of rigged.

I haven’t seen details of LLM benchmarks’ data sets but I would suppose that “questions” are public so known in advance therefore you can tune a model as much as possible.

One of real benchmarks is drawing of pelican - https://github.com/simonw/pelican-bicycle - Simon Willison made it for his llms’ tests.

If you want really find out a model that works for your specific purpose I would recommend several rounds at arena.ai - it helps to find a anonymously a model without confirmation bias.

Some ppl: Claude is the best! Others to them: but Qwen is the best! Or… Codex is better! …

it all depends on the language (English, Dutch, French…), style of querying (caveman, specs, skills, goal etc.)

Even with the same model I get different answers to same prompt that is just tweaked a little.

So benchmarks are nice but mostly useless.

Without your usecase it is just a reference number indicating the approximate position of that model among the others. And for those who want to make money it is a marketing tool to sell more as every customer counts.

You can't measure "feels".

One good analogy is the Macbook vs generic windows laptop debate online.

The engineer mind just compares numbers, the Lingwoo laptop from Amazon has biggest numbers for everything and the lowest price. Ergo it is the best.

But the numbers don't measure the fact that the Lingwoo creaks and squeaks when you lift it due to the cheap plastic. It also runs at 100C when both CPU and GPU are fully utilised. The keyboard feels like a membrane keyboard from a milspec device from the 90s. Numbers also don't measure the fact that Linwoo is an alphabet soup whitelabel manufacturer that won't exist in any legal capacity in 6 months so good luck with any warranty issues.

There will be an identical laptop called Chongwin being sold though. Completely different company, definitely.

--

The same applies to LLMs. You can do benchmarks like ask them to one-shot different kinds of gotcha questions (car wash, strawberry and other idiotic ones) or get them to write different kinds of programs.

But that doesn't measure the UX of doing so at all. How many times do you actually need any of those when you're actually working?

It's like unit testing an application. Every function can have 100% test coverage and the app can still be shit because there are things you can't unit test for.

> You can't measure "feels".

One can always measure whatever they wonder about. It doesn't mean the measure will be trustworthy and that anything built on it won't be at best not worst than wet finger judgement.

Feels are just opinions and taste. It's like art and music, you can't quantify either to a mathematical formula or an absolute test of which is good.

Even songs that break the "rules" of music can be subjectively good, either because they broke the rules or despite it.

Or with cars, a car that's beautiful to one person is the ugliest piece of trash on the street. Some people want a super soft ride where their espresso martini doesn't even vibrate when gunning it through a gravel road and others want to feel every grain of sand on the asphalt in their buttocks. Neither is "correct" and there is no objective measurement for ride comfort.

Maybe someone can devise a distributed bench-marking system where multiple people collaborate on tests and also vet each other's tests and rating without revealing them to the public.

I have my own "interview questions" for models where I give them a premade Git repo and a problem to solve. Then, I rate them like a teacher. I believe other do that as well, so we only need a reliable system to aggregate these results.

The problem with proprietary models behind APIs is that they could have saved your benchmark for future training though.

The only way to make it fair is to have the model provider give some benchmarking org the weights + inference engine, so that the model can be run in complete isolation and no information about the benchmark is leaked.

Though I guess for a 'random' person's benchmark that hides between all other requests it's probably ok.