Not OP, but CLIP from OpenAi (2021) seems pretty standard and gives great results at least in English (not so good in rarer languages).
Essentially CLIP lets to encode both text and images in same vector space.
It is really easy and pretty fast too generate embeddings. Took less than hour on Google Colab.
I made a quick and dirty Flask app that lets me query my own collection of pictures and provide most relevant ones via cosine similarity.
You can query pretty much anything on CLIP (metaphors, lightning, object, time, location etc).
From what I understand many photo apps offer CLIP embedding search these days including Immich - https://meichthys.github.io/foss_photo_libraries/
Alternatives could be something like BLIP.