It would unfortunately also need several runs of each to be reliable. There's nothing in TFA to indicate the results shown aren't to a large degree affected by random chance!
(I do think from personal benchmarks that Gemini 3 is better for the reasons stated by the author, but a single run from each is not strong evidence.)
TFA says multiple times that the results are affect by random chance
Yes, but recognising that is only the first step. Quantifying the variance is the next step which I miss in the article.