^Yep, unfortunately, the best option right now seems to pipe the output into another LLM to do some cleanup, which we try to help you do in Whispering. Recent transcription models don't have very good built-in inference/cleanup, with Whisper having the very weak "prompt" parameter. It seems like this is probably by design to keep these models lean/specialized/performant in their task.
By try to help, do you mean that it currently does so or that functionality is otw