Have you looked at Titan and MIRAS where they use online/updating associative memory that happens to be read out via next-token prediction?
https://research.google/blog/titans-miras-helping-ai-have-lo...
https://arxiv.org/abs/2501.00663
https://arxiv.org/pdf/2504.13173
Much research is going into these directions, but I'm more interested in mind-wandering tangents, involving both attentional control and additional mechanisms (memory retrieval, self-referential processing).
Memory in world models is interesting. But I think the main issue is that its holding everything in pixel space (its not, but it feels like that) rather than concept space. Thats why its hard for it to synthesise consistently.
However I am not qualified really to make that assertion.