> The problem is going to be how to control those models to produce a universe that's temporally and spatially consistent.

Why not just have a simple, low-poly rasterizer and have AI fill in the details?

That's essentially the way that AMD FX and NVIDIA DLSS work today, although they do take fully rendered frames as input.