>It’s fucking video made by a computer after you type a sentence

Geez. Breathe a little bit. It's always weird to see somebody who had zero involvement in the creation/engineering/design of a product so disproportionately defensive about it.

The concept of text to video since LTX, CogVideo, AnimateDiff is closing in on two years of development at this point so there's naturally going to be a little less breathless enthusiasm.

If you had experience with even the locally hostable stuff like Hunyuan, Wan 2.2, VACE, etc. you'd probably be less impressed as well. The video they demoed had more fast cuts than a Michael Bay movie - illustrating the exact problem that video models STILL suffer from - a failure to generate anything longer than 60 seconds. In fact I didn't even see anything longer than 10 seconds at most. Maybe it's tailormade for an ADHD audience who grew up on Vine Videos.

On a more positive note, the physics have definitely improved though you can tell in some of the shots that the coherency degrades the longer scene goes on (see the volleyball clip).

I'm assuming you've played more with AI video creation than I have.

Was there anything impressive here or is this mostly "meh"? I didn't see this solve any of the problems with AI videos but maybe this is solving something that I didn't know was a problem?

That's really what I'm trying to figure out with this announcement. Seeing 100s of comments about how impressive this is with no comments really discussing why has me trying to figure out what part of the hype I'm missing.

Good question. The one thing that Altman really seemed keen to play up was the whole integrate yourself into the video which from what I watched is definitely a step beyond the more conventional Image-To-Video models.

Depressingly that's probably a killer feature since if there's one thing people want to see more of it's themselves.

IMHO I also think the fact that they're trying to position themselves as a sort of infinite doom-scrolling tiktok lends support to the idea that their models are still only suitable for relatively short videos since coherency probably falls off a cliff after 30-60 seconds.