Here's a recent paper showing that models trained to generate videos develop strong geometric representations and understanding:

https://arxiv.org/abs/2512.19949