That's really interesting. What if they RAG search related videos from the prompt, and condition on that to generate? That might explain fidelity like this

An interesting counterexample is "a screen recording of the boot screen and menus for a user playing Mario Kart 64 on the N64, they play a grand prix and start to race" where the UI flow matches the real Mario Kart 64, but the UI itself is wrong: https://x.com/fofrAI/status/1973151142097154426

I like the player being in "1th" while being behind everyone else. Still crazy though.