I feel like there's a bit if a disconnect with the cool video demos demonstrated here and say, the type of world models someone like Yann Lecunn is talking about.
A proper world model like Jepa should be predicting in latent space where the representation of what is going on is highly abstract.
Video generation models by definition are either predicting in noise or pixel space (latent noise if the diffuser is diffusing in a variational encoders latent space)
It seems like what this lab is doing is quite vanilla, and I'm wondering if they are doing any sort of research in less demo sexy joint embedding predictive spaces.
There was a recent paper, LeJepa from LeCunn and a postdoc that actually fixes many of the mode distribution collapse issues with the Jepa embedding models I just mentioned.
I'm waiting on the startup or research group that gives us an unsexy world model. Instead of giving us 1080p video of supermodels camping, gives us a slideshow of something a 6 year old child would draw. That would be a more convincing demonstrator of an effective world model.
> Video generation models by definition are either predicting in noise or pixel space
I don't see that this follows "by definition" at all.
Just because your output is pixel values doesn't mean your internal world model is in pixel space.
You need to train a decoder either end to end or conditioned on latents.
In either case the impressiveness of that decoder can be far removed from the effectiveness of your world model or involve no world model at all
Making convincing videos of the world without having a world model would be like writing convincing essays about computing without understanding computing.
There's a few things to consider here: - there are many aspects to the video that are not convincing, indicating these videogen models do not grok the world the same way a typical human does - A 6 year old child is almost certainly incapable of recreating pixel level fidelity video footage, yet understands the world extremely well... far beyond what current robotics is capable of.
The two facts above should be indicative that predicting noise (as with DDPM diffusion models), or predicting pixel level (or even VAE latent "pixel") information is probably not the optimal path to world understanding. Probably not even a good path to good world models.
Dreamer4 (https://danijar.com/project/dreamer4/) is a promising direction (by a frontier lab)