What models did you use for the stages? I see Qwen2.5-VL-7B-Instruct mentioned as an advanced option, so I assume maybe Qwen2.5-VL-3B-Instruct by default (which is what I also use for a lot of stuff, it is incredibly good at "clean" OCR, but as you maybe indicate not the best at "describing a scene").
EDITED: I didn't realize Whisper was a local model. I never tried transcription before, so I had always figured it was a pay model by OpenAI. I'll have to check it out (although the runtime listed here is a bit daunting).
For that project I'll say I don't see much degradation in embedding quality at much much worse quality than 720p (all the way down to 240p), which speeds things up considerably. Although I don't really do face or object detection, just scene embeddings. To me any process whereby it would take longer to process the video than watch it is probably a no go in general. Obviously a challenge for local-first analysis.