There is definitely room for improvement: https://gist.github.com/simonw/88eecc65698a725d8a9c1c918478a...
Especially when it comes to detailed outputs or non-standard prompts.
I do believe it will get even better - not sure it will happen within a year but I wouldn't be incredibly surprised if it did.
Yep. “Where’s Waldo” has been a classic challenge for generative models for a while because it requires understanding the entire concept (there’s only one Waldo), while also holding up to scrutiny when you examine any individual, ordinary figure.
I experimented with the concept of procedural generation of Waldo-style scavenger images with Flux models with rather disappointing results. (unsurprisingly).
That's a good example, actually.
If you asked me what I expected, since this one has "thinking", it'd be that it would've thought to do something like generate the image without Waldo first, then insert Waldo somewhere into that image as an "edit"
I wonder if at this point you could just ask the agent to iteratively refine the image in smaller portions.