You should look at this Google paper (came out a few days ago):

https://video-zero-shot.github.io/