Even the original stable diffusion app had image 2 image. It just didn’t work as well. I‘m not sure why this is supposed to be novel.
Even the original stable diffusion app had image 2 image. It just didn’t work as well. I‘m not sure why this is supposed to be novel.
It’s obviously not a new model capability. But using this well-known, existing capability to solve this particular issue is only obvious after the fact.
It’s a useful trick to have in one’s toolbox, and I’m grateful to the author for sharing it.
It's not novel in the sense that nobody knew about img2img. It's novel in the sense that nobody thought of using img2img to solve this problem in this way.
It's novel if you never played with img2img, including especially several forms of (text+img)2img. Or, if you never tried editing images by text prompt in recent multimodal LLMs.
That said, I spent plenty of time doing both, and yet it would probably take me a while to arrive at this approach. For some reason, the "draw a sketch, have a model flesh it out" approach got bucketed with Stable Diffusion in my mind, and multimodal LLMs with "take detailed content, make targeted edits to it". So I'm glad the OP posted it.
Ok it might just be me then. I view Nvidia‘s DLSS as a similar thing. There was even this meme that video games will in the future only output basic geometry and the AI layer transforms it into stunning graphics.