I totally get these are very hard problems so solve and that we're on the bleeding edge of what's possible but I can't help and wonder when someone is going to crack real video understanding.
sure, maybe it's still frame-by-frame but so fast and so often that the model retains a rolling context of what's going on and can answer cleanly temporal questions.
"how packages were delivered over the last hour", etc.