How can the community tell if models overfit to these benchmarks?
By the composition of evals. Plus secondary metrics like parameter size, and token cost.
Not perfect, but useful.
By the composition of evals. Plus secondary metrics like parameter size, and token cost.
Not perfect, but useful.