Whisk itself (https://labs.google/fx/tools/whisk) was released a few months ago under the radar as a demo for Imagen 3 and it's actually fun to play with and surprisingly robust given its particular implementation.
It uses a prompt transmutation trick (convert the uploaded images into a textual description; can verify by viewing the description of the uploaded image) and the strength of Imagen 3's actually modern text encoder to be able to adhere to those long transmuted descriptions for Subject/Scene/Style.
> This tool isn’t available in your country yet
> Enter your email to be notified when it becomes available
(Submit)
> We can't collect your emails at the moment
GDPR ftw!
I'm not a lawyer but I thought GDPR didn't prevent that. It adds a lot of restrictions on how they can use those emails for how long, but not a complete ban on explicit sharing of emails.
If you read it very carefully, and then behave very carefully, you can comply with the law. Orrrrrr you can just not bother for your first pass, simply block the EU for now, and release it for them after you go and clean it up later.
Yup that's what's been happening with many models. US first and Europe comes a few months after once they double checked everything and made sure the paperwork is in order.
Easily circumventes with a VPN though, most just limit by location, not busy account data.
GDPR may not prevent it explicitly, but around the world GDPR has a chilling effect on many businesses, small and large, that often results in longer launch delays to covered countries while armies of lawyers double and triple check everything in fear of large fines.
Why text? why not encode the image into some latent space representation, so that it can survive a round-trip more or less faithfully?
Because Imagen 3 is a text-to-image model, not an image-to-image model, so the inputs have to be some form of text. Multimodal models such as 4o image generation or Gemini 2.0 which can take in both text and image inputs do encode image inputs to a latent space through a Vision Transformer, but not reverseable or losslessly.
Typically generative models, particularly diffusion models like Imagen 3, are easily architected to support several vectors toward the latent space of the model. It is not open source so there might be an architectural reason I cannot see, but I don't think the public interface to the model should suggest its capabilities -- it is uncommon for image to image not to be supported in open source image generation models, for example. However, there are definite legal reasons not to provide such a vector in a public facing model like Imagen 3.
And Gemini gave the Yes-man treatment to my statement here :D "In summary: Your assessment aligns well with the technical realities of diffusion models and the practical, legal, and safety considerations large companies face when deploying powerful generative AI tools publicly. It's entirely feasible that Imagen 3's underlying architecture could support image inputs, but Google has chosen not to expose this capability publicly due to the associated risks and complexities."
There’s a thing called CLIP Vision that sort of does that, but it converts the image into conditioning space (the same space as the embeddings from a text prompt). I’d say it works… OK.
They don't want you to modify images you supply yourself.
Text might honestly be the best latent space representation.
A word tells a thousand pictures.
Seems to require a paid subscription to actually use all the way thru.