Why text? why not encode the image into some latent space representation, so that it can survive a round-trip more or less faithfully?

Because Imagen 3 is a text-to-image model, not an image-to-image model, so the inputs have to be some form of text. Multimodal models such as 4o image generation or Gemini 2.0 which can take in both text and image inputs do encode image inputs to a latent space through a Vision Transformer, but not reverseable or losslessly.

Typically generative models, particularly diffusion models like Imagen 3, are easily architected to support several vectors toward the latent space of the model. It is not open source so there might be an architectural reason I cannot see, but I don't think the public interface to the model should suggest its capabilities -- it is uncommon for image to image not to be supported in open source image generation models, for example. However, there are definite legal reasons not to provide such a vector in a public facing model like Imagen 3.

And Gemini gave the Yes-man treatment to my statement here :D "In summary: Your assessment aligns well with the technical realities of diffusion models and the practical, legal, and safety considerations large companies face when deploying powerful generative AI tools publicly. It's entirely feasible that Imagen 3's underlying architecture could support image inputs, but Google has chosen not to expose this capability publicly due to the associated risks and complexities."

There’s a thing called CLIP Vision that sort of does that, but it converts the image into conditioning space (the same space as the embeddings from a text prompt). I’d say it works… OK.

They don't want you to modify images you supply yourself.

Text might honestly be the best latent space representation.

A word tells a thousand pictures.