I have two surprises for you:

1. Don't believe the pundits of RAG. They never implemented one.

I did many times, and boy, are they hard and have so many options that decide between utterly crappy results or fantastic scores on the accuracy scale with a perfect 100% scoring on facts.

In short: RAG is how you fill the context window. But then what?

2. How does a superlarge context window solve your problem? Context windows ain't the problem, accurate matching requirements is. What do your inquiry expect to solve? Greatest context window ever, but what then? No prompt engineering is coming to save you if you don't know what you want.

RAG is in very simple terms simply a search engine. Context window was never the problem. Never. Filling the context window, finding the relevant information is one problem, but also only part of the solution.

What if your inquiry needs a combination of multiple sources to make sense? There is no 1:1 matching of information, never.

"How many cars from 1980 to 1985 and 1990 to 1997 had between 100 and 180PS without Diesel in the color blue that were approved for USA and Germany from Mercedes but only the E unit?"

Have fun, this is a simple request.

> What if your inquiry needs a combination of multiple sources to make sense? There is no 1:1 matching of information, never.

I don't see the problem if you give the LLM the ability to generate multiple search queries at once. Even simple vector search can give you multiple results at once.

> "How many cars from 1980 to 1985 and 1990 to 1997 had between 100 and 180PS without Diesel in the color blue that were approved for USA and Germany from Mercedes but only the E unit?"

I'm a human and I have a hard time parsing that query. Are you asking only for Mercedes E-Class? The number of cars, as in how many were sold?

It doesn't help that academia loooves ColBERT and will happily tell you how amazing -- and, look, for how tiny the models are, 20M params and super fast on a CPU, it is -- they are at seemingly everything if only you...

- Chunk properly;

- Elide "obviously useless files" that give mixed signals;

- Re-rank and rechunk the whole files for top scoring matches;

- Throw in a little BM25 but with better stemming;

- Carry around a list of preferred files and ideally also terms to help re-rank;

And so on. Works great when you're an academic benchmaxing your toy Master's project. Try building a scalable vector search that runs on any codebase without knowing anything at all about it and get a decent signal out of it.

Ha.