How do the agents perform the transcription? I'm guessing just calling out to other tools like Whisper? Do all models/agents take the same approach or do they differ?
also as a parent, I love the bluey bench concept !
How do the agents perform the transcription? I'm guessing just calling out to other tools like Whisper? Do all models/agents take the same approach or do they differ?
also as a parent, I love the bluey bench concept !
I am using whisper transcription via the Groq API to transcribe the files in parallel. But (caveat), I cut out the transcription step and had the models operate on a shared transcript folder. So the times you see are pure search and categorization times.
re. your question about the approach – they all took on the problem in different ways that I found fascinating.
Codex Spark was so fast because it noticed that bluey announces the episode names in the episode ("This episode of Bluey is called ____.") so, instead of doing a pure matching of transcript<->web description, it cut out the title names from the transcripts and matched only that with the episode descriptions.
The larger models were more careful and seemed to actually try to doublecheck their work by reading the full transcripts and matching them against descriptions.
gpt-5.2 went through a level of care that wasn't wrong, but was unnecessary.
Sonnet 4.5 (non-thinking) took the most frustrating approach. It tried to automate the pairing process with scripting to match the extracted title with the official title via regex. So, instead of just eyeballing the lists of extracted and official titles to manually match them, it relied purely on the script's logging as its eyes. When the script failed to match all 52 episodes perfectly, it went into a six-iteration loop of writing increasingly convoluted regex until it found 52 matches (which ended up incorrectly matching episodes). It was frustrating behavior, I stopped the loop after four minutes.
In my mind, the "right way" was straightforward but that wasn't borne out by how differently the llms behaved.
Most frontier models are multi-modal and can handle audio or video files as input natively.
I'm experimenting right now with an English to Thai subtitle translator that feeds in the existing English subtitles as well as a mono (centre-weighted) audio extracted using ffmpeg. This is needed because Thai has gendered particles -- word choice depends on the sex of the speaker, which is not recorded in English text. The AIs can infer this to a degree, but they do better when given audio so that they can do speaker diarization.