I really hate the world model terminology, but the actual low level gripe between LeCunn and autoregressive LLMs as they stand now is the fact that the loss function needs to reconstruct the entirety of the input. Anything less than pixel perfect reconstruction on images is penalized. Token by token reconstruction also is biased towards that same level of granularity.

The density of information in the spatiotemporal world is very very great, and a technique is needed to compress that down effectively. JEPAs are a promising technique towards that direction, but if you're not reconstructing text or images, it's a bit harder for humans to immediately grok whether the model is learning something effectively.

I think that very soon we will see JEPA based language models, but their key domain may very well be in robotics where machines really need to experience and reason about the physical the world differently than a purely text based world.

Isn't the Sora video model a ViT with spatiotemporal inputs (so they've found a way to compress that down), but at the same time LeCunn wouldn't consider that a world model?

VideoGen models have to have decoder output heads that reproduce pixel level frames. The loss function involes producing plausible image frames that requires a lot of detailed reconstruction.

I assume that when you get out of bed in the morning, the first thing you dont do is paint 1000 1080p pictures of what your breakfast looks like.

LeCunns models predict purely in representation space and output no pixel scale detailed frames. Instead you train a model to generate a dower dimension representation of the same thing from different views, penalizing if the representation is different ehen looking at the same thing