Any idea what factors play into latency in TTS models?

Mostly model size, and input size. Some models which use attention are O(N^2)