Totally valid concern. Right now the cost ($2.50/hr) and latency make continuous real-time indexing impractical, but that won't always be the case. This is one of the reasons I'd want to see open-weight local models for this, keeps the indexing on your own hardware with no footage leaving your machine. But you're right that the broader trajectory here is worth thinking carefully about.

It's 2.50 an hour because Google has margins. A nation state could do it at cost, and even if it's not a huge difference, the price of a year's worth of embeddings is just $21,900. That's a rounding error, especially considering it's a one time cost for footage.

Right? $2.50 an hour is trivial to a Government that can vote to invent a trillion dollars. Even just 1 million dollars is the cost of monitoring 45 real time feeds for a year. I'm sure just many very rich people would pay that for the safety of their compound.

How are you getting to $2.50/hr ? The price sheet says its 0.00079 per frame.

https://ai.google.dev/gemini-api/docs/pricing#gemini-embeddi...

From what I see the code downsamples video to 5 fps, so 1 hour of video is 3600 seconds * 5 fps = 18,000 frames. 18,000 frames * $0.00079/frame = $14.22. A couple dollars more with the overlap.

(The code also tries to skip "still" frames, but if your video is dynamic you're looking at the cost above.)

you're right that the code uses ffmpeg to downsample the chunks to 5fps before sending them, but that's only a local/bandwidth optimization, not what the api actually processes.

regardless of the file's frame rate, the gemini api natively extracts and tokenizes exactly 1 fps. the 5 fps downscaling just keeps the payload sizes small so the api requests are fast and don't timeout.

i'll update the readme to make this more clear. thanks for bringing this up.

Thanks for the details and correction.