The novel aspect here seems to be 3D LiDAR output from 2D video using post-training. As far as I'm aware, no other video world models can do this.
IMO, access to DeepMind and Google infra is a hugely understated advantage Waymo has that no other competitor can replicate.
3d from moving 2d images has been a thing for decades.
This is 3D LiDAR output (multimodal) from 2D images.
LiDAR is the technology used to do spatial capture. The output is just point clouds of surfaces. So they’re generating surface point clouds from video
It's not unheard of, there are a handful [0] of metric monodepth methods that output data that's not unlike a really inaccurate 3D lidar, though theirs certainly looks SOTA.
[0] https://github.com/YvanYin/Metric3D