> No transcription, no frame captioning, no intermediate text.

If there is text on the video (like a caption or wtv), will the embedding capture that? Never thought about this before.

If the video has audio, does the embedding capture that too?

Yes to both. The embedding is over raw video frames, so anything visible (text, signs, captions) gets captured in the vector. And Gemini Embedding 2 extracts the audio track and embeds it alongside the visual frames. So a query like 'someone yelling' would theoretically match on audio. My dashcam footage doesn't have audio though, so I haven't tested that side yet.