There's good reasons to do this. Embedding similarity is _not_ a reliable method of determining relevance.

I did some measurements and found you can't even really tell if two documents are "similar" or not. Here: https://joecooper.me/blog/redundancy/

One common way is to mix approaches. e.g. take a large top-K from ANN on embeddings as a preliminary shortlist, then run a tuned LLM or cross encoder to evaluate relevance.

I'll link here these guys' paper which you might find fun: https://arxiv.org/pdf/2310.08319

At the end of the day you just want a way to shortlist and focus information that's cheaper, computationally, and more reliable, than dumping your entire corpus into a very large context window.

So what we're doing is fitting the technique to the situation. Price of RAM; GPU price; size of dataset; etc. The "ideal" setup will evolve as the cost structure and model quality evolves, and will always depend on your activity.

But for sure, ANN-on-embedding as your RAG pipeline is a very blunt instrument and if you can afford to do better you can usually think of a way.

The "redundacy" experiment is very interesting! Strongly agree, we just need to do something better than "dumping your entire corpus into a very large context window", maybe using this table-of-contents methods would be very useful?

[dead]