> Visually, they are stunning.

The input images are stunning, model's result is another disappointing trip to uncanny valley. But we feel Ok as long as the sequence doesn't horribly contradict the original image or sound. That is the world model.

  > But we feel Ok as long as the sequence doesn't horribly contradict the original image or sound. 
Is the error I pointed out not "horribly contradicting"?

  > That is the world model.
I would say that if it is non-physical[0] then it's hard to call it a /world/ model. A world is consistent and has a set of rules that must be followed.

I've yet to see a claimed world model that actually captures this behavior. Yet it's something every game engine[1] gets very well. We'd call it a bad physics engine if they made the same mistakes we see even the most advanced "world models" do.

This is part of why I'm trying to explain that visual quality is actually orthogonal. Even old Atari games have consistent world models despite being pixelated. Or think about Mario on the original NES. Even the physics breaking in that game are more edge cases and not the norm. But here, things like the lion's tail is not consistent even to a 2D world. I've never bought the explanation that teleporting in front of and behind the leg is an artifact of embedding 3D into 2D[2] because the issue is actually the model not understanding collision and occlusion. It does not understand how the sections relate to one another in the image.

The major problem with these systems is that they just hope that the physics is recovered through enough examples of videos. Yet if one studied physics (beyond your basic college courses) you'd understand the naïveté of that. It took a long time to develop physics due to these specific limitations. These models don't even have the advantage of being able to interact with the environment. They have no mechanisms to form beliefs and certainly no means to test them. It's essentially impossible to develop physics through observation alone

[0] with respect to the physics of the world being simulated. I want you distinguish real world physics from /a physics/

[1] a game physics engine is a world model. Which, as in stressing in [0], does not necessarily need follow real world physics. Mistakes happen of course but things are generally consistent.

[2] no video and almost no game is purely 2D. They tend to have backgrounds which places some layering but we'll say 2D for convenience and since we have a shared understanding

> The major problem with these systems is that they just hope that the physics is recovered through enough examples of videos. Yet if one studied physics (beyond your basic college courses) you'd understand the naïveté of that. It took a long time to develop physics due to these specific limitations. These models don't even have the advantage of being able to interact with the environment. They have no mechanisms to form beliefs and certainly no means to test them. It's essentially impossible to develop physics through observation alone

Sounds like these world models are speed running from Platonic ideals to empiricism.

[dead]

>A world is consistent and has a set of rules that must be followed.

Large language models are mostly consistent, but they have mistakes even in grammar too, from time to time. And it's usually called a "hallucination". Can't we say physics errors are a kind of "hallucination" too, in a world model? I guess the question is, what hallucination rate are we willing to tolerate.

It's not about making no mistakes, it's about the categorical type of mistakes.

Let's consider language as a world, in some abstract sense. Lies may (or may not) be consistent here. Do they make sense linguistically? But then think about the category of errors where they start mixing languages and sound entirely nonsensical. That's rare with current LLMs in standard usage but you can still get them to have full on meltdowns.

This is the class of mistakes these models are making, not the failing to recite truth class of mistakes.

(Not a perfect translation but I hope this explanation helps)

> Yet if one studied physics (beyond your basic college courses) you'd understand the naïveté of that.

I studied enough physics to get a mech. eng. diploma. And I still understand the naivete. Observational physics can be derived with ml, and I have derived them, but not with neural nets. Or if you do it with neural nets, you can't alpha zero it, you need to cheat.