The paper says the professors have a median of 200 comparisons each. It also says they only used 2 models because using more models would require more comparisons and they selected Google models because Google was branded/advertised as being education focused. When you see other models show up elsewhere, that's because they extended the main idea to other models but using LLMs to judge instead of human professors.

Sure, but the biggest problem is they have no statistical significance. Variance is too high. How do you distinguish the signal from the noise? Confidence intervals aren't enough.

But is it a surprise law professors aren't great statisticians?

I disagree. 16 isn't necessarily the relevant N here but the number of responses is.

If you have 100 responses from 1 professor, and the AI wins 75% of the time that is very likely a true signal that the AI is better than this prof. It would be incorrect to generalize this to all profs though.

Further, if you sample 16 profs and the AI beats 10 of them you can be fairly certain that the real percentage of profs it beats isn't 10%. Further, when estimating the probability that the AI beats a random prof, it's the relative estimation error that scales with 1/sqrt N. If you have a coin and it lands heads up 16 times, that tells you something quite robust about the coin.

Reasonably estimating confidence intervals at small N and high p is not trivial. But it can be done.

A good heuristic is "add 2 successes and 2 failures" which is due to Agresti & Couli.

See down the page here for source papers:

https://en.wikipedia.org/wiki/Binomial_proportion_confidence...

I think it is more likely that they selected Gemini because the lead author is a fellow at an institute which receives a lot of their funding from Google.