Hacker News

> Embeddings are great at basic conceptual similarity, but in quality maximalist fields and use cases they fall apart very quickly.

This makes a lot of sense if you think about it. You want something as conceptually similar to the correct answer as possible. But with vector search, you are looking for something conceptually similar to some formulation of the question, which has some loose correlation, but is very much not the same thing.

There's ways you can prepare data to try to get a closer approximation (e.g. you can have an LLM formulate for each indexed block questions that it could answer and index those, and then you'll be searching for material that answers a question similar to the question being asked, which is a bit closer to what you want, but its still an approximation.

But if you ahead of time know from experience salient features of the dataset that are useful for the particular application, and can index those directly, it just makes sense that while this will be more labor intensive than generalized vector search and may generalize less well outside of that particular use case, it will also be more useful in the intended use case in many places.