Yes. Adding alpha channels would be step one. Then perhaps incorporate the "element" concept that is basically any identifiable visual anything; which is what VFX uses as a composite-capable element. Then build a whole visual scene description prose that is what we give to a video AI, and that prose is high level language where necessary and element-wise specific where necessary. Base that scene description prose on the language used by film makers directly, just adopt their terminology, and then track the industry's jargon within the models. That way anyone working in media will auto-magically know how to control them.

We are at a point now where it is now how to write software that is the problem but how to describe to the software that is the problem. Video and film making is so generalized, AI needs more information. Typically that information comes from a director's and their team's consistency during production. AI has neither the information for consistency of imagery nor the narrative and the perspective of the narrative a human director and team bring. In time, AI will develop large enough contexts, but will the hardware to run that be affordable? There is a huge amount of context in both an entire script and the world view perspective a film crew brings to any script, and for that reason I think many of the traditional (VFX included) film roles are not going to suddenly disappear. AI video does not replace their consistency at their budget, hands down.

When AI video is able to be just a part of the skill set, for example when it is compatible with compositing, editing, and knows that terminology, AI video will be adopted more. Right now, it is designed as an all or nothing offering.