I’d like to see embedding of actual video clips become practical in this type of workflow.
Frame level embedding it covering a lot, but can miss out on a lot of action related searches.
I’d like to see embedding of actual video clips become practical in this type of workflow.
Frame level embedding it covering a lot, but can miss out on a lot of action related searches.
Sure, I'm using (https://huggingface.co/collections/Qwen/qwen25-vl) which can help me understand action like falling down, because I can provide for example 5 frames (down scaled to 720p) to understand what is happening in this part of the video