Agentic retrieval is really more a form of deep research (from a product standpoint there is very little difference). The key is that LLMs > rerankers, at least when you're not at webscale where the cost differential is prohibitive.
Agentic retrieval is really more a form of deep research (from a product standpoint there is very little difference). The key is that LLMs > rerankers, at least when you're not at webscale where the cost differential is prohibitive.
LLMs > rerankers. Yes! I don't like rerankers. They are slow, the context window is small (4096 tokens), it's expensive... It's better when the LLM reads the whole file versus some top_chunks.
Rerankers are orders of magnitude faster and cheaper than LLMs. Typical latency out of the box on a decent sized cross encoder (~4B) will be under 50ms on cheap gpus like an A10G. You won’t be able to run a fancy LLM on that hardware and without tuning you’re looking at hundreds of ms minimum.
More importantly, it’s a lot easier to fine tune a reranker on behavior data than an LLM that makes dozens of irrelevant queries.
This is worth emphasizing. At scale, and when you have the resources to really screw around with them to tune your pipeline, rerankers aren't bad, they're just much worse/harder to use out of the box. LLMs buy you easy robustness, baseline quality and capabilities in exchange for cost and latency, which is a good tradeoff until you have strong PMF and you're trying to increase margins.
More than that, adding longer context isn’t free either in time or money. So filling an LLM context with k=100 documents of mixed relevance may be slower than reranking and filling with k=10 of high relevance.
Of course, the devil is in the details and there’s five dozen reasons why you might choose one approach over the other. But it is not clear that using a reranker is always slower.