I don't think I'm disagreeing, just adding more colour.
> It very frequently "forgets" what is outside its view
This was the observations that I saw when we were testing it. My former lab was late to pivoting to robotics, so we were looking at the current state of play to see what machine perception stuff is out there for robotics.
Have you looked at Titan and MIRAS where they use online/updating associative memory that happens to be read out via next-token prediction?
https://research.google/blog/titans-miras-helping-ai-have-lo...
https://arxiv.org/abs/2501.00663
https://arxiv.org/pdf/2504.13173
Much research is going into these directions, but I'm more interested in mind-wandering tangents, involving both attentional control and additional mechanisms (memory retrieval, self-referential processing).
Memory in world models is interesting. But I think the main issue is that its holding everything in pixel space (its not, but it feels like that) rather than concept space. Thats why its hard for it to synthesise consistently.
However I am not qualified really to make that assertion.
Ah, thanks for the clarification. It can be hard to interpret on these forums sometimes.