image-to-image speech-to-speech exists; yes almost everything is text-to, but there are exceptions