Dataset.

To train these models you need inputs and expected output. For text-image pairs there exists vast amounts of data (in the billions). The models are trained on text + noise to output a denoised image.

The dataset of sketch-image pairs are significantly smaller, but you can finetune an already trained text->image model using the smaller dataset by replacing the noise with a sketch, or anything else really, but the quality of the output of the finetuned model will highly depend on the base text->image model. You only need several thousand samples to create a decent (but not excellent) finetune.

You can even do it without finetuning the base model and training a separate network that applies on top of base text->image model weights, this allows you to have a model that essentially can wear many hats and do all kinds of image transformations without affecting the performance of the base model. These are called controlnets and are popular with the stable diffusion family of models, but the general technique can be applied to almost any model.

These datasets would definitely have a lot of Text => Sketch pairs as well. I wonder if its possible to extrapolate from Text => Sketch and Text => Image pairs to improve Sketch => Image capabilities. The models must be doing some notion of it already.