These can be useful for labs training models. I don't see them as particularly valuable for building AI systems. Real performance depends on how the system is built, much more so than the underlying LLM.

Evaluating the system you build on relevant inputs is most important. Beyond that it would be nice to see benchmarks that give guidance on how and LLM should be used as a system component, not just which is "better" at something.

My thoughts were this. The moment it is public it’s probably in the training data set. The real evals are the ones that you have to make an a problem you’re trying to solve and the data you are using.