I am using whisper transcription via the Groq API to transcribe the files in parallel. But (caveat), I cut out the transcription step and had the models operate on a shared transcript folder. So the times you see are pure search and categorization times.

re. your question about the approach – they all took on the problem in different ways that I found fascinating.

Codex Spark was so fast because it noticed that bluey announces the episode names in the episode ("This episode of Bluey is called ____.") so, instead of doing a pure matching of transcript<->web description, it cut out the title names from the transcripts and matched only that with the episode descriptions.

The larger models were more careful and seemed to actually try to doublecheck their work by reading the full transcripts and matching them against descriptions.

gpt-5.2 went through a level of care that wasn't wrong, but was unnecessary.

Sonnet 4.5 (non-thinking) took the most frustrating approach. It tried to automate the pairing process with scripting to match the extracted title with the official title via regex. So, instead of just eyeballing the lists of extracted and official titles to manually match them, it relied purely on the script's logging as its eyes. When the script failed to match all 52 episodes perfectly, it went into a six-iteration loop of writing increasingly convoluted regex until it found 52 matches (which ended up incorrectly matching episodes). It was frustrating behavior, I stopped the loop after four minutes.

In my mind, the "right way" was straightforward but that wasn't borne out by how differently the llms behaved.