I set it up as a joke, to make fun of all of the other benchmarks. To my surprise it ended up being a surprisingly good measure of the quality of the model for other tasks (up to a certain point at least), though I've never seen a convincing argument as to why.

I gave a talk about it last year: https://simonwillison.net/2025/Jun/6/six-months-in-llms/

It should not be treated as a serious benchmark.

What it has going for it is human interpretability.

Anyone can look and decide if it’s a good picture or not. But the numeric benchmarks don’t tell you much if you aren’t already familiar with that benchmark and how it’s constructed.