Your “benchmark” is invalid. Penalizing the model because the hosting environment is being DDoSed by users a few hours after launch is utter nonsense.
I see that you tried to justify this lower in the thread, but no… it completely invalidates your benchmark. You are not testing the model. You are conflating one specific model host and model performance, and then claiming you are benchmarking the model. All major models are hosted by multiple different services.
In the real world, clients will just retry if there is a server error, and that will not impact response quality at all, and the workflow the model is being used in will not fail. If a workflow is so poorly coded that it doesn’t even have retry logic, then that workflow is doomed no matter which host you use. But again, reliability of the host is separate from the model.
You can make your benchmark valid by having separate leaderboards for model quality and host reliability. I’m not saying to throw the whole thing away. But the current claim is not valid.
And you’re also making an unsourced claim that everyone else has already determined this model sucks? Nah. The first result from Artificial Analysis shows good things: https://x.com/ArtificialAnlys/status/2047547434809880611
But I am still waiting to see the results from the full suite of AA benchmarks.