They dont need them to be released(in the sense that you have a copy of the binary) to evaluate the model. The costly model training is useless unless access is given to people who pay for it.
The models of Open AI, Claude and other major companies - are all available either for free or a small amount(200$ for OpenAI Pro). Anyone who can pay this, can run private tests and compare scores. So, the public does not need to rely on benchmark claims of OpenAI based on its pre-release arrangements with test companies.
> Anyone who can pay this, can run private tests and compare scores.
Yes, by uploading the tests to a server controlled by OpenAI/Anthropic/etc
These prompts are fielding millions of queries. The test questions are a small part of them. Further, the server doesn't know if it got the right answer or not, so it can't even train on them. Whereas in the arrangement with the testing companies before release, they can potentially do so, as the they are given the scores.