Image Diffusion models are already capable of this. There was a research paper and I believe model was released as well, which ch generates visual illusions where an image when flipped becomes something else.
Same idea here. A text needs to be diffused from two views until it looks the same but still matches the input. It might already exist.
Edit: https://diffusionillusions.com/
Edit: Ambigram using Diffusion models https://raymond-yeh.com/AmbiGen/