I would have assumed anyone frequenting HN would have figured out by now that benchmarks are 100% bullshit. I guess I'd be wroing.

I think anyone frequenting HN and actually using these tools absolutely knows these benchmarks are 100% bullshit and the only real way to test these things is to just use them yourself.

Many small models are supposedly good for controlled tasks, but given a detailed prompt, I can't get any of them to follow simple instructions. They usually just regurgitate the examples in the system prompt. Useless.

So what do you propose? Gut feel, N=1 tests?

At the moment, the only way you can tell if the model is good for a particular task is by trying it at that task. Gut feel is how you pick the models to test first, and that is also based largely on past experience and educated guesses as to what strengths translate between tasks.

You should also remember that there's no free lunch. If you see models below a certain size fail consistently, don't expect a model that is even smaller to somehow magically succeed, no matter how much pixie dust the developer advertises.

[deleted]

it currently beats depending on the benchmarks

I mean, in other environments people say that.

If you asked "What's the best bicycle", most enthusiasts would say one you tried, works for your usecase, etc.

Benchmarks should be for pruning models you try at the absolute highest level, because at the end of the day it's way too easy to hack them without breaking any rules (post-train on the public, generate a ton of synthetic examples, train on those, repeat)