Hacker News

There’s a thing called CLIP Vision that sort of does that, but it converts the image into conditioning space (the same space as the embeddings from a text prompt). I’d say it works… OK.