The problem with proprietary models behind APIs is that they could have saved your benchmark for future training though.
The only way to make it fair is to have the model provider give some benchmarking org the weights + inference engine, so that the model can be run in complete isolation and no information about the benchmark is leaked.
Though I guess for a 'random' person's benchmark that hides between all other requests it's probably ok.