It's likely based on just the transcript, even if it describes visual things, it likely guesses those things from the transcript text only.
Maybe it's better now, but that was how it did it recently. To be convinced that it "watches" the video, I would need to see evidence of it referring to facts that are strictly only possible to know from the video, but not guessable from the audio.
You can try it with your own recorded video. I record myself doing exercises and Gemini gives me really good feedback on my form.