The point is that video generation is not the goal in itself. Just like classifying photos as cat vs dog wasn't the goal in 2013. I know that Sora 2 is not a world model.

But what's coming is: Vision-language-action models and planning, spatial AI (SLAM with semantics and 3D reconstruction with interactability and affordance detection). Video diffusion models, photo-to-gaussian-splats, video-to-3D (e.g. from Hunyuan), the whole DUSt3R/VGGT line of works, V-JEPA 2 etc. Or if you want product names, Gemini Robotics 1.5, Genie 3, etc. The field is progressing incredibly fast. Humanoid robots are progressing fast. Robotic hands with haptic sensors are more dexterous than ever. It's starting to work. We are only seeing the first glimpses of course.

It's largely irrelevant in terms of intelligence. What you're describing is throwing out 2-D topological integrations (what we do to achieve optic flow ultra fast reaction times in motion), vicarious trial and error, and brute force imposing a machine wax fruit of motion dexterity. It's simply not analog to events the way we experience, it's been cooked up in cog-sci as imitation, but it's not even that. The more we understand the brain's architecture and process, the less relevant this gets, as it's not for legitimate long-term bio ware. There are no world models, the idea is oxymoronic as the topological bypasses this in scale invariance. It's all a dead end this binary, since eventually, analog will rule this with minimal energy and software and use an entirely different software. Think of any arriving too early industry, AI is irrelevant, the first step was reinventing software. It took the least efficient compute principle and drove it to irrelevance using machine vision as an endgame. The lack of redundancies is the tell.

I wonder what is this fascination with human shaped robots, if spider shaped robots could be more dexterous and productive.

(Unless it's sci-fi and porn that is mainly pushing for human shaped robots.)

The built environment fits the human form factor well. Imitation learning and intuitive teleoperation is also easier. But it won't be the only form factor. The quadruped form (like Spot) is also popular, as well as drones etc.