There's good reasons to do this. Embedding similarity is _not_ a reliable method of determining relevance.
I did some measurements and found you can't even really tell if two documents are "similar" or not. Here: https://joecooper.me/blog/redundancy/
One common way is to mix approaches. e.g. take a large top-K from ANN on embeddings as a preliminary shortlist, then run a tuned LLM or cross encoder to evaluate relevance.
I'll link here these guys' paper which you might find fun: https://arxiv.org/pdf/2310.08319
At the end of the day you just want a way to shortlist and focus information that's cheaper, computationally, and more reliable, than dumping your entire corpus into a very large context window.
So what we're doing is fitting the technique to the situation. Price of RAM; GPU price; size of dataset; etc. The "ideal" setup will evolve as the cost structure and model quality evolves, and will always depend on your activity.
But for sure, ANN-on-embedding as your RAG pipeline is a very blunt instrument and if you can afford to do better you can usually think of a way.
The "redundacy" experiment is very interesting! Strongly agree, we just need to do something better than "dumping your entire corpus into a very large context window", maybe using this table-of-contents methods would be very useful?
[dead]