I should have been more clear. Those are NOT the direct prompts. They are the starter prompts. In fact that's why the attempt numbers change, we adapt the exact prompts depending on the model.

I understood that much, at least from the description you added on the Kontext result. I agree that you should provide more information here, though, especially around "we adapt the exact prompts depending on the model", since your strategy here could also reflect model strengths and weaknesses.

Good point! Perhaps I should add in the "final model-specific prompt", or place them in an errata section.

By the way, this is what I got from Kontext after just a couple of tries: https://i.imgur.com/J4LwkVI.png

Prompt: "Keeping the glass and the hand behind the glass the same, please change only the three brown candies in the glass into green, yellow, red, and orange candies. Make no other changes. Change the reflection to remove the brown candy too." Seed was 1070229954903864, but your setup is probably too different for that to help.

It seems like Gemini 2.5 Flash was the only model that successfully removed the reflections...it should get some points for that!

[deleted]