I take some issue with that testing methodology. It seems to me that you're conflating the model's performance with the reliability of whatever provider you're using to run the benchmark.
Many models, especially open weight ones, are served by a variety of providers in their lifetime. Each provider has their own reliability statistics which can vary throughout a model's lifetime, as well as day to day and hour to hour.
Not to mention that there are plenty of gateways that track provider uptime and can intelligently route to the one most likely to complete your request.