I think anyone frequenting HN and actually using these tools absolutely knows these benchmarks are 100% bullshit and the only real way to test these things is to just use them yourself.

Many small models are supposedly good for controlled tasks, but given a detailed prompt, I can't get any of them to follow simple instructions. They usually just regurgitate the examples in the system prompt. Useless.