Image Diffusion models are already capable of this. There was a research paper and I believe model was released as well, which ch generates visual illusions where an image when flipped becomes something else.

Same idea here. A text needs to be diffused from two views until it looks the same but still matches the input. It might already exist.

Edit: https://diffusionillusions.com/

Edit: Ambigram using Diffusion models https://raymond-yeh.com/AmbiGen/