I am not really technical in this domain, but why is everything text-to-X?
Wouldn't it be possible to draw a rough sketch of a terrain, drop a picture of the character, draw a 3D spline for the walk path, while having a traditional keyframe style editor, and give certain points some keyframe actions (like character A turns on his flashlight at frame 60) - in short, something that allows minute creative control just like current tools do?
Dataset.
To train these models you need inputs and expected output. For text-image pairs there exists vast amounts of data (in the billions). The models are trained on text + noise to output a denoised image.
The dataset of sketch-image pairs are significantly smaller, but you can finetune an already trained text->image model using the smaller dataset by replacing the noise with a sketch, or anything else really, but the quality of the output of the finetuned model will highly depend on the base text->image model. You only need several thousand samples to create a decent (but not excellent) finetune.
You can even do it without finetuning the base model and training a separate network that applies on top of base text->image model weights, this allows you to have a model that essentially can wear many hats and do all kinds of image transformations without affecting the performance of the base model. These are called controlnets and are popular with the stable diffusion family of models, but the general technique can be applied to almost any model.
These datasets would definitely have a lot of Text => Sketch pairs as well. I wonder if its possible to extrapolate from Text => Sketch and Text => Image pairs to improve Sketch => Image capabilities. The models must be doing some notion of it already.
Everything is text-to-X because it's less friction and therefore more fun. It's more a marketing thing.
There are many workflows for using generative AI to adhere to specific functional requirements (the entire ComfyUI ecosystem, which includes tools such as LoRAs/ControlNet/InstantID for persistence) and there are many startups which abstract out generative AI pipelines for specific use cases. Those aren't fun, though.
Huh "everything text-to-X"? Most video gen AI has image-to-video option too either as a start or end frame or just as a reference for subjects and environment to include in the video. Some of them even has video-to-video options too, to restyle the visuals or reuse motions from the reference video.
LLMs were entirely text not that long ago.
Multi modality is new; you won’t have to wait too long until they can do what you’re describing.
You can do image+text as well (although maybe the results are better if you do raw image to prompted image to video?)
image-to-image speech-to-speech exists; yes almost everything is text-to, but there are exceptions
I want ...-to-3D-scene. Then I can use Blender to render the resulting picture and/or vid. Be it "text-to-3D-scene" or "image-to-3D-scene".
And there's a near infinity of data out there to train "image-to-3D-scene" models. You can literally take existing stuff and render it from different angles, different lighting, different background, etc.
I've seen a few unconclusive demos of "...-to-3D-scene" but this 100% coming.
I can't wait to sketch out a very crude picture and have an AI generate me a 3D scene out of that.
> ... in short, something that allows minute creative control just like current tools do?
With 3D scenes generated by AI, one shall be able to decide to just render it as it (with proper lighting btw) or one shall all all the creative control he wants.
I want this now. But I'll settle with waiting a bit.
P.S: same for songs and sound FX by the way... I want the AI to generate me stuff I can import in an open-source DAW. And this is 100% coming too.