Hacker News

How can the community tell if models overfit to these benchmarks?

By the composition of evals. Plus secondary metrics like parameter size, and token cost.

Not perfect, but useful.