> Video generation models by definition are either predicting in noise or pixel space
I don't see that this follows "by definition" at all.
Just because your output is pixel values doesn't mean your internal world model is in pixel space.
> Video generation models by definition are either predicting in noise or pixel space
I don't see that this follows "by definition" at all.
Just because your output is pixel values doesn't mean your internal world model is in pixel space.
You need to train a decoder either end to end or conditioned on latents.
In either case the impressiveness of that decoder can be far removed from the effectiveness of your world model or involve no world model at all
Making convincing videos of the world without having a world model would be like writing convincing essays about computing without understanding computing.
There's a few things to consider here: - there are many aspects to the video that are not convincing, indicating these videogen models do not grok the world the same way a typical human does - A 6 year old child is almost certainly incapable of recreating pixel level fidelity video footage, yet understands the world extremely well... far beyond what current robotics is capable of.
The two facts above should be indicative that predicting noise (as with DDPM diffusion models), or predicting pixel level (or even VAE latent "pixel") information is probably not the optimal path to world understanding. Probably not even a good path to good world models.