I did an inpainting project for a client a few years ago. They were trying to inpaint banner ads for concert promoters, and find a way to make it easy to produce a bunch of different sized ads for a variety of placements. I was tasked with inpainting Xmas themed ad for a few major singers.
The weirdest thing was when the inpainting tool added strange people to an image. This singer was all decked out in tinsel and red, and the inpainting model added a grumpy old man in a top hat. I don't recall clicking the "Add creepy old man" button.
At the time this was Stable Diffusion on the backend, run by a variety of model hosting services, Amazon being one. They all had different requirements for the input image and that made things really complex. For some the aspect ratio was impossible to meet, and it would fail if the banner was 200x60. For others, you had to resize it before input, which meant you were adding an image with poor resolution to start. Garbage in, garbage out.
All of this to say, there is a lot of preproduction that went into it, and the client never ended up using my attempts.
This singer was all decked out in tinsel and red, and the inpainting model added a grumpy old man in a top hat. I don't recall clicking the "Add creepy old man" button.
Obvious reference to the Dickens story A Christmas Carol. In the UK there's a bylaw that requires Christmassy events to hire a Scrooge-like figure to lurk in the background so people keep their enthusiasm in check.
> At the time this was Stable Diffusion on the backend
The community made models (merges, fine tunes, etc) of that era are all completely overtrained and optimized for portraits and frontal shots. They would try to make a person out of anything. Inpainting faces is already a chore, even with a lot of tooling around that, but inpainting anything else is almost impossible. These models are also especially bad to fit objects naturally into scenes. You can make a crappy necklace or belt work, but introducing a new object into a scene just fails with infinite variety.
They are also much better using 512x512 as resolution, any larger deviation introduces more problems.
Considering you wanted to inpaint banner ads, they would probably get distorted heavily. Those models can't deal with fonts and are bad at a pixel perfect transfers. The only viable way to do this, at that time, would be to manually insert the banner ads and fix the seams with AI. Requires some artistic skill of course.
Your attempt was bold, but with the expectation of just supplying two images and let the models do it, it was impossible.
> For others, you had to resize it before input, which meant you were adding an image with poor resolution to start.
Thats because small models like SD (Stable Diffusion) are trained on very specific resolutions, its the fancier models that are trained on higher quality, or more diverse sets of resolutions, and if you use a higher quality model to generate lower resolution images, what's actually happening is you're trimming a much bigger image and getting a chunk of it output, at least that's how it feels based on my many hours of experimenting. If I use major models and try to center a thing, I never see it in the center. :) My GPU can only handle so much.
So traditionally, the way you’d do this (and why some UIs like automatic1111 let you configure inpainting so flexibly) is that you didn’t have to shrink the entire image.
The general idea was: you mask the area you want changed, and the model inpaints that region at full resolution. The advantage of masking, compared to plain img2img, is that you’re not sending the entire picture to the model.
With the classic setups like SD 1.5 and SDXL, you’d effectively inpaint at full resolution: take the masked area from a larger image, scale just that region to the model’s native resolution, process it at the full ~1 megapixel then scale it back and composite it into the original. This lets you add MORE detail.
Unfortunately if the OP is using hosted SD models, they might not have that granular control and thus would suffer pretty bad quality loss.
I was kind of speaking more in general I realized, not just strictly inpainting, but yeah that makes sense, though I've had inpainting also limited by the image being too big for my GPU to handle as well. I may be using it incorrectly though, not really experimented with much of that in a while, maybe when I get a newer gaming rig.
Yeah, the landscape also changes a lot as well. It’s just really hard to keep up with everything. Especially if you’re using it casually because some of the UI wrappers (the Gradio-based ones) have more obscure knobs and dials than a TI‑82 calculator.
This is the image I always think of when first introducing someone to ComfyUI or even Automatic1111.
https://imgur.com/a/G0Xlznj