I think the issue is that "world models" are poorly defined.

With this kind of image gen, you can sorta plan robot interactions, but its super slow. I need to find the paper that deepmind produced, but basically they took the current camera input, used a text prompt like "robot arm picks up the ball", the video generated the arm motion, then the robot arm moved as it did in the video.

The problem is that its not really a world model, its just image gen. Its not like the model outputs a simulation that you can interact with (without generating more video) Its not like it creates a bunch of rough geo that you can then run physics on (ie you imagine a setup, draw it out and then run calcs on it.)

There is lots of work on making splats editable and semantically labeled, but again thats not like you can run physics on them so simulation is still very expensive. Also the properties are dependent on running the "world model" rather than querying the output at a point in time

  > poorly defined.
Poorly defined is not the same as undefined. There are bounds and we have a decent understanding of what this means. Not having the details all worked out is not the same. Though that lack of precision is being used to get away with more slop.

  > I need to find the paper that deepmind produced
I've seen that paper and the results pretty close to the action. I've even personally talked with people that worked on that paper. It very frequently "forgets" what is outside its view and it very frequently performs non-physically consistent actions. When you evaluate those models don't just try standard things, do weird things. Like keep trying to extend the grabber arm and it shouldn't jump to other parts of the screen.

  > The problem is that its not really a world model, its just image gen.
Yes, that was my point. Since you agree I'm not sure why you're disagreeing.

I don't think I'm disagreeing, just adding more colour.

> It very frequently "forgets" what is outside its view

This was the observations that I saw when we were testing it. My former lab was late to pivoting to robotics, so we were looking at the current state of play to see what machine perception stuff is out there for robotics.

Have you looked at Titan and MIRAS where they use online/updating associative memory that happens to be read out via next-token prediction?

https://research.google/blog/titans-miras-helping-ai-have-lo...

https://arxiv.org/abs/2501.00663

https://arxiv.org/pdf/2504.13173

Much research is going into these directions, but I'm more interested in mind-wandering tangents, involving both attentional control and additional mechanisms (memory retrieval, self-referential processing).

Memory in world models is interesting. But I think the main issue is that its holding everything in pixel space (its not, but it feels like that) rather than concept space. Thats why its hard for it to synthesise consistently.

However I am not qualified really to make that assertion.

Ah, thanks for the clarification. It can be hard to interpret on these forums sometimes.