The model seems pretty shitty. Does it only look on a frame-by-frame basis? Literally one second of video context and it would never make that mistake.