I still feel varying the prompt text, number of tries, and varying strictness combined with only showing the result most liked dilute most of the value in these test. It would be better if there was one prompt 8/10 human editors understood and implemented correctly and then every model got 5 generation attempts with that exact prompt on different seeds or something. If it were about "who can create the best image with a given model" then I'd see it more, but most of it seems aimed at preventing that sort of thing and it ends up in an awkward middle zone.
E.g. Gemini 2.5 Flash is given extreme leeway with how much it edits the image and changes the style in "Girl with Pearl Earring" only to have OpenAI gpt-image-1 do a (comparatively) much better job yet still be declared failed after 8 attempts, while having been given fewer attempts than Seedream 4 (passed) and less than half the attempts of OmniGen2 (which still looks way farther off in comparison).
A "worst image" instead of best image competition may be easy to implement and quite indicative of which one has less frustration experience.
OP here. That's kind of the idea of listing the number of attempts alongside failure/successes. It's a loose metric for how "compliant" a model is - e.g. how much work you have to put it in order to get a nominally successful result.
The OpenAI gpt-image-1 example was supposed to be noted as for the "You Only Move Twice" test.