Does this use CLIP or something to get embeddings for each image and normal text embeddings for the text fields, and then feed the top N results to a VLM (LLM) to select the best answer(s)?
What's the advantage of this over using llamaindex?
Although even asking that question I will be honest, the last thing I used llamaindex for, it seemed mostly everything had to be shoehorned in as using that library was a foregone conclusion, even though ChromaDB was doing just about all the work in the end because the built in test vector store that llamaindex has strangely bad performance with any scale.
I do like how simple the llamaindex DocumentStore or whatever is where you can just point it at a directory. But it seems when using a specific vectordb you often can't do that.
I guess the other thing people do is put everything in postgres. Do people use pgvector to store image embeddings?
LlamaIndex relies heavily on RAG-style approaches, e.g., we're using items whose embedding vectors are close to the embedding vectors of the question (what you describe). RAG-style approaches work great if the answer depends only on a small part of the data, e.g., if the right answer can be extracted from a few top-N documents.
It's less applicable if the answer cannot be extracted from a small data subset. E.g., you want to count the number of pictures showing red cars in your database (rather than retrieving a few pictures of red cars). Or, let's say you want to tag beach holiday pictures with all the people who appear in them. That's another scenario where you cannot easily work with RAG. ThalamusDB supports such scenarios, e.g., you could use the query below in ThalamusDB:
SELECT H.pic FROM HolidayPictures H, ProfilePictures P as Tag WHERE NLFILTER(H.pic, 'this is a picture of the beach') AND NLJOIN(H.pic, P.pic, 'the same person appears in both pictures');
ThalamusDB handles scenarios where the LLM has to look at large data sets and uses a few techniques to make that more efficient. E.g., see here (https://arxiv.org/abs/2510.08489) for the implementation of the semantic join algorithm.
A few other things to consider:
1) ThalamusDB supports SQL with semantic operators. Lay users may prefer the natural language query interfaces offered by other frameworks. But people who are familiar with SQL might prefer writing SQL-style queries for maximum precision.
2) ThalamusDB offers various ways to restrict the per-query processing overheads, e.g., time and token limits. If the limit is reached, it actually returns a partial result (e.g., lower and upper bounds for query aggregates, subsets of result rows ...). Other frameworks do not return anything useful if query processing is interrupted before it's complete.
We use a vector db (Qdrant) to store embeddings of images and text and built a search UI atop it.
Cool. And the other person implies that the queries can search across all rows if necessary? For example if all images have people and the question is which images have the same people in them. Or are you talking about a different project?
I think the previous post refers to a different project. But yes: ThalamusDB can process all rows if necessary, including matching all images that have the same persons in them.