Hacker News

It's generally one-shot-only - whatever comes out the first time is what I go with.

I've been contemplating a more fair version where each model gets 3-5 attempts and then can select which rendered image is "best".

Try llm-consortium with --judging-method rank

I think it will make results way better and more representative of model abilities..

It would... but the test is inherently silly, so I'm still not sure if it's worth me investing that extra effort in it.