We know Open AI got caught getting benchmark data and tuning their models to it already. So the answer is a hard no. I imagine over time it gives a general view of the landscape and improvements, but take it with a large grain of salt.
We know Open AI got caught getting benchmark data and tuning their models to it already. So the answer is a hard no. I imagine over time it gives a general view of the landscape and improvements, but take it with a large grain of salt.
Are you referring to FrontierMath?
We had access to the eval data (since we funded it), but we didn't train on the data or otherwise cheat. We didn't even look at the eval results until after the model had been trained and selected.
No one believes you.
If you don't believe me, that's fair enough. Some pieces of evidence that might update you or others:
- a member of the team who worked with this eval has left OpenAI and now works at a competitor; if we cheated, he would have every incentive to whistleblow
- cheating on evals is fairly easy to catch and risks destroying employee morale, customer trust, and investor appetite; even if you're evil, the cost-benefit doesn't really pencil out to cheat on a niche math eval
- Epoch made a private held-out set (albeit with a different difficulty); OpenAI performance on that set doesn't suggest any cheating/overfitting
- Gemini and Claude have since achieved similar scores, suggesting that scoring ~40% is not evidence of cheating with the private set
- The vast majority of evals are open-source (e.g., SWE-bench Pro Public), and OpenAI along with everyone else has access to their problems and the opportunity to cheat, so FrontierMath isn't even unique in that respect
The same thing was done with Meta researchers with Llama 4 and what can go wrong when 'independent' researchers begin to game AI benchmarks. [0]
You always have to question these benchmarks, especially when the in-house researchers can potentially game them if they wanted to.
Which is why it must be independent.
[0] https://gizmodo.com/meta-cheated-on-ai-benchmarks-and-its-a-...