This might be inherent to how the models are benchmarked.
Aren’t some benchmarks giving the model multiple shots at a problem and only keep the successful result if it appeared, ignoring the failure rate?
This might be inherent to how the models are benchmarked.
Aren’t some benchmarks giving the model multiple shots at a problem and only keep the successful result if it appeared, ignoring the failure rate?
Good point. We need the mean, “any 1 of 10” and the “all 10 of 10” success rates in the metrics, so we can estimate reliability (the last one).