You're looking for text-to-speech. Qwen actually has a model and library for this: Qwen3-TTS [1].
[1]: https://github.com/QwenLM/Qwen3-TTS