Hacker News

I work in this field, so I can answer.

Embeddings are great at basic conceptual similarity, but in quality maximalist fields and use cases they fall apart very quickly.

For example:

"I want you to find inconsistencies across N documents." There is no concept of an inconsistency in an embedding. However, a textual summary or context stuffing entire documents can help with this.

"What was John's opinion on the European economy in 2025?" It will find a similarity to things involving the European economy, including lots of docs from 2024, 2023, etc. And because of chunking strategies with embeddings and embeddings being heavily compressed representations of data, you will absolutely get chunks from various documents that are not limited to 2025.

"Where are Sarah or John directly quoted in this folder full of legal documents?" Sarah and John might be referenced across many documents, but finding where they are directly quoted is nearly impossible even in a high dimensional vector.

Embeddings are awesome, and great for some things like product catalog lookups and other fun stuff, but for many industries the mathematical cosign similarity approach is just not effective.

> Embeddings are great at basic conceptual similarity, but in quality maximalist fields and use cases they fall apart very quickly.

This makes a lot of sense if you think about it. You want something as conceptually similar to the correct answer as possible. But with vector search, you are looking for something conceptually similar to some formulation of the question, which has some loose correlation, but is very much not the same thing.

There's ways you can prepare data to try to get a closer approximation (e.g. you can have an LLM formulate for each indexed block questions that it could answer and index those, and then you'll be searching for material that answers a question similar to the question being asked, which is a bit closer to what you want, but its still an approximation.

But if you ahead of time know from experience salient features of the dataset that are useful for the particular application, and can index those directly, it just makes sense that while this will be more labor intensive than generalized vector search and may generalize less well outside of that particular use case, it will also be more useful in the intended use case in many places.

dragonwriter 5 days ago [ - ]

ineedasername 5 days ago [ - ]

Yes, sure vector similarity has limits, but does this address PageIndex's approach to those limits? I mean, beyond the approach of "Add structure with recursive LLM API calls, show LLM that structure to search". I don't see where PageIndex is doing more than this.