The video to audio examples are really impressive! The video featuring the band showcases some of the obvious shortcomings of this method (humans will have very precise expectations about the kinds of sounds 5 trombones will make)—but the tennis example shows its strengths (decent timing of hit sounds, eerily accurate acoustics for the large internal space). I'm very excited to see how this improves a few more papers down the line!

There were a lot of shortcomings.

- The woman playing what I think was an Erhu[1] seemed to be imitating traditional music played by that instrument, but really badly (it sounded much more like a human voice than the actual instrument does). Also, I'm not even sure if it was able to tell which instrument it was, or if it was picking up on other cues from the video (which could be problematic, e.g. if it profiles people based on their race and attire)

- Most of the sound was pretty delayed from the visual cues. Not sure why

- The nature sounds were pretty muddy

- (I realize this is from video to music, but) the video with pumping upbeat music set to the text "Maddox White witnessed his father getting butchered by the Capo of the Italian mob" was almost comically out of touch with the source

Nevertheless, it's an interesting demo and highlights more applications for AI which I'm expecting we'll see massive improvements in over the next few years! So despite the shortcomings I agree it's still quite impressive.

[1] https://en.wikipedia.org/wiki/Erhu