This Sora 2 generation of Cyberpunk 2077 gameplay managed to reproduce it extremely closely, which is baffling: https://x.com/elder_plinius/status/1973124528680345871

> How the FUCK does Sora 2 have such a perfect memory of this Cyberpunk side mission that it knows the map location, biome/terrain, vehicle design, voices, and even the name of the gang you're fighting for, all without being prompted for any of those specifics??

> Sora basically got two details wrong, which is that the Basilisk tank doesn't have wheels (it hovers) and Panam is inside the tank rather than on the turret. I suppose there's a fair amount of video tutorials for this mission scattered around the internet, but still––it's a SIDE mission!

Everyone already assumed that Sora was trained on YouTube, but "generate gameplay of Cyberpunk 2077 with the Basilisk Tank and Panam" would have generated incoherent slop in most other image/video models, not verbatim gameplay footage that is consistent.

For reference, this is what you get when you give the same prompt to Veo 3 Fast (trained by the company that owns YouTube): https://x.com/minimaxir/status/1973192357559542169

> Everyone already assumed that Sora was trained on YouTube

Doesn't this already answer your question...? "Let's Play" type videos and streams have been a thing for years now, even for more obscure games. It very well could've been trained on Cyberpunk videos of that mission.

It's hard for me to believe that the model coherently memorized both the video and audio of a relatively obscure Let's Play, and that a simple prompt was enough to surface it (the use of the term "Basilisk tank" would also likely not be in video metadata either). That is the reason the person who made that tweet, who has far more prompting experience than myself, was shocked.

It’s hard for you to believe, sure, and I recognize the context of who tweeted it.

I still maintain that’s the kernel it’s getting it from. It’s impressive, I’m just not really shocked by it as a concept.

That's really interesting. What if they RAG search related videos from the prompt, and condition on that to generate? That might explain fidelity like this

An interesting counterexample is "a screen recording of the boot screen and menus for a user playing Mario Kart 64 on the N64, they play a grand prix and start to race" where the UI flow matches the real Mario Kart 64, but the UI itself is wrong: https://x.com/fofrAI/status/1973151142097154426

I like the player being in "1th" while being behind everyone else. Still crazy though.