Yeah sorry that was unclear on my part. I chunk at the endpoint level, whisper itself obviously processes 30s windows. The memory/latency thing I was referring to is more about processing longer files end to end through the pipeline, not a single whisper pass. My fastapi wrapper just splits the audio and runs chunks sequentially so total wall time scales linearly with file length, nothing fancy.
Wondering similar. It certainly can run beyond 30 seconds but at some point I believe the output should degrade
Plus you could do actual batch inference instead. Or if you must carry forward the context you could still do it linearly, but the mem usage shouldn’t just explode
Yeah sorry that was unclear on my part. I chunk at the endpoint level, whisper itself obviously processes 30s windows. The memory/latency thing I was referring to is more about processing longer files end to end through the pipeline, not a single whisper pass. My fastapi wrapper just splits the audio and runs chunks sequentially so total wall time scales linearly with file length, nothing fancy.
Wondering similar. It certainly can run beyond 30 seconds but at some point I believe the output should degrade
Plus you could do actual batch inference instead. Or if you must carry forward the context you could still do it linearly, but the mem usage shouldn’t just explode