Is there anything unique here happening for the video aspect or is it just taking snapshots over and over?
I’ve been looking for a good video summarizing / understanding model!
Is there anything unique here happening for the video aspect or is it just taking snapshots over and over?
I’ve been looking for a good video summarizing / understanding model!
Nothing unique, it's just taking a snapshot when it's processing the input. Even processing a single image will increase the TTFT by ~0.5s on my machine, so for now, it seems to be impossible for feeding a live video and expecting a real-time response.
In regards to the video capability, I haven't tested it myself, but here's a benchmark/comparison from Google [0]
[0] https://huggingface.co/blog/gemma4#video-understanding
I totally get these are very hard problems so solve and that we're on the bleeding edge of what's possible but I can't help and wonder when someone is going to crack real video understanding.
sure, maybe it's still frame-by-frame but so fast and so often that the model retains a rolling context of what's going on and can answer cleanly temporal questions.
"how packages were delivered over the last hour", etc.