Why not use a LLM with the speech to text output?