A "worst image" instead of best image competition may be easy to implement and quite indicative of which one has less frustration experience.

OP here. That's kind of the idea of listing the number of attempts alongside failure/successes. It's a loose metric for how "compliant" a model is - e.g. how much work you have to put it in order to get a nominally successful result.