This is so much more useful than synthetic benchmarks. The most important column here isn't pass/fail, it's attempts. In production a model that gets it right in 2 attempts is 10x more valuable than one that needs 20 iterations of prompt engineering. It's a direct measure of cost and predictability.

Seedream 4 won on points, but Gemini seems more steerable and required less fighting on many of the tasks