Hacker News

I've found his hybrid approach pretty good for the majority of use cases. BM25 (maybe Splade if you want a blend of BOW/Keyword), + Vectors + RRF + re-rank works pretty damn well.

The trick that has elevated RAG, at least for my use cases, has been having different representations of your documents, as well as sending multiple permutations of the input query. Do as much as you can in the VectorDB for speed. I'll sometimes have 10-11 different "batched" calls to our vectorDB that are lightning quick. Then also being smart about what payloads I'm actually pulling so that if I do use the LLM to re-rank in the end, I'm not blowing up the context.

TLDR: Yes, you actually do have to put in significant work to build an efficient RAG pipeline, but that's fine and probably should be expected. And I don't think we are in a world yet where we can just "assume" that large context windows will be viable for really precise work, or that costs will drop to 0 anytime soon for those context windows.