Hacker News

Dataset.

To train these models you need inputs and expected output. For text-image pairs there exists vast amounts of data (in the billions). The models are trained on text + noise to output a denoised image.

The dataset of sketch-image pairs are significantly smaller, but you can finetune an already trained text->image model using the smaller dataset by replacing the noise with a sketch, or anything else really, but the quality of the output of the finetuned model will highly depend on the base text->image model. You only need several thousand samples to create a decent (but not excellent) finetune.

You can even do it without finetuning the base model and training a separate network that applies on top of base text->image model weights, this allows you to have a model that essentially can wear many hats and do all kinds of image transformations without affecting the performance of the base model. These are called controlnets and are popular with the stable diffusion family of models, but the general technique can be applied to almost any model.