Hacker News

minimaxir 5 days ago [ - ]

Whisk itself (https://labs.google/fx/tools/whisk) was released a few months ago under the radar as a demo for Imagen 3 and it's actually fun to play with and surprisingly robust given its particular implementation.

It uses a prompt transmutation trick (convert the uploaded images into a textual description; can verify by viewing the description of the uploaded image) and the strength of Imagen 3's actually modern text encoder to be able to adhere to those long transmuted descriptions for Subject/Scene/Style.

cubefox 5 days ago [ - ]

> This tool isn’t available in your country yet

> Enter your email to be notified when it becomes available

(Submit)

> We can't collect your emails at the moment

fragmede 5 days ago [ - ]

GDPR ftw!

patates 5 days ago [ - ]

I'm not a lawyer but I thought GDPR didn't prevent that. It adds a lot of restrictions on how they can use those emails for how long, but not a complete ban on explicit sharing of emails.

fragmede 5 days ago [ - ]

If you read it very carefully, and then behave very carefully, you can comply with the law. Orrrrrr you can just not bother for your first pass, simply block the EU for now, and release it for them after you go and clean it up later.

the_duke 5 days ago [ - ]

Yup that's what's been happening with many models. US first and Europe comes a few months after once they double checked everything and made sure the paperwork is in order.

Easily circumventes with a VPN though, most just limit by location, not busy account data.

drusepth 4 days ago [ - ]

GDPR may not prevent it explicitly, but around the world GDPR has a chilling effect on many businesses, small and large, that often results in longer launch delays to covered countries while armies of lawyers double and triple check everything in fear of large fines.

torginus 5 days ago [ - ]

Why text? why not encode the image into some latent space representation, so that it can survive a round-trip more or less faithfully?

minimaxir 5 days ago [ - ]

Because Imagen 3 is a text-to-image model, not an image-to-image model, so the inputs have to be some form of text. Multimodal models such as 4o image generation or Gemini 2.0 which can take in both text and image inputs do encode image inputs to a latent space through a Vision Transformer, but not reverseable or losslessly.

waffletower 4 days ago [ - ]

Typically generative models, particularly diffusion models like Imagen 3, are easily architected to support several vectors toward the latent space of the model. It is not open source so there might be an architectural reason I cannot see, but I don't think the public interface to the model should suggest its capabilities -- it is uncommon for image to image not to be supported in open source image generation models, for example. However, there are definite legal reasons not to provide such a vector in a public facing model like Imagen 3.

waffletower 4 days ago [ - ]

And Gemini gave the Yes-man treatment to my statement here :D "In summary: Your assessment aligns well with the technical realities of diffusion models and the practical, legal, and safety considerations large companies face when deploying powerful generative AI tools publicly. It's entirely feasible that Imagen 3's underlying architecture could support image inputs, but Google has chosen not to expose this capability publicly due to the associated risks and complexities."

Uehreka 5 days ago [ - ]

There’s a thing called CLIP Vision that sort of does that, but it converts the image into conditioning space (the same space as the embeddings from a text prompt). I’d say it works… OK.

doctorpangloss 5 days ago [ - ]

They don't want you to modify images you supply yourself.

flkenosad 5 days ago [ - ]

Text might honestly be the best latent space representation.

waffletower 4 days ago [ - ]

A word tells a thousand pictures.

j45 5 days ago [ - ]

Seems to require a paid subscription to actually use all the way thru.