Why RAG at all?
We concatenated all our docs and tutorials into a text file, piped it all into the AI right along with the question, and the answers are pretty great. Cost was, last I checked, roughly 50c per question. Probably scales linearly with how much docs you have. This feels expensive but compared to a human writing an answer it's peanuts. Plus (assuming the customer can choose to use the AI or a human), it's great customer experience because the answer is there that much faster.
I feel like this is a no-brainer. Tbh with the context windows we have these days, I don't completely understand why RAG is a thing anymore for support tools.
This works as long as your docs are below the max context size (and even then, as you approach larger context sizes, quality degrades).
Re cost though, you can usually reduce the cost significantly with context caching here.
However, in general, I’ve been positively surprised with how effective Claude Code is at grep’ing through huge codebases.
Thus, I think just putting a Claude Code-like agent in a loop, with a grep tool on your docs, and a system prompt that contains just a brief overview of your product and brief summaries of all the docs pages, would likely be my go to.
Oh man, maybe this would cause people to write docs that are easy to grep through. Let’s start up that feedback loop immediately, please.
How will you grep synonyms or phrases with different word choices?
I’m hoping that the documentation will be structured in a way such that Claude can easily come up with good grep regexes. If Claude can do it, I can probably do it only a little bit worse.
What you describe sounds like poor man's RAG. Or lazy man's. You're just doing the augmentation at each prompt.
With RAG the cost per question would be low single-digit pennies.
Accuracy drops hard with context length still. Especially in more technical domains. Plus latency and cost.
That is not particularly cheap, especially since it scales linearly with doc size, and therefore time.
Additionally the quality of loading the context-window decreases linearly as well, just because your model can handle 1M tokens it doesn't mean that it WILL remember 1M tokens, it just means that it CAN
RAG fixes this, in the simplest configuration a RAG can be an index, and the only context you give the LLM is the table of contents, and you let it search through the index.
Should it be a surprise that this is cheaper and more efficient? Loading the context window is like a library having every book open at every page at the same time instead of using the dewey decimal system
What you described is RAG. Inefficient RAG, but still RAG.
And it's inefficient in two ways-
-you're using extra tokens for every query, which adds up.
-you're making the LLM less precise by overloading it with potentially irrelevant extra info making it harder for it to needle in a haystack the specific relevant answer.
Filtering (e.g. embedding similarity & BM25) and re-ranking/pruning what you provide to RAG is an optimization. It optimizes the tokens, the processing time, and optimizes the answer in an ideal world. Most LLMs are far more effective if your RAG is limited to what is relevant to the question.
I don't think it's RAG, RAG is specifically separating the search space from the LLM context-window or training set and giving the LLM tools to search in inference-time.
In this case their Retrieval stage is "SELECT *", basically, so sure I'm being loose with the terminology, but otherwise it's just a non-selective RAG. Okay ..AG.
RAG is selecting pertinent information to supply to the LLM with your query. In this case they decided that everything was pertinent, and the net result is just reduced efficiency. But if it works for them, eh.
I'm not sure we are talking about the same thing. The root comment talks about concatenating all doc files into a loong text string, and adding that as a system/user prompt to the LLM at inference time before the actual question.
You mention the retrieval stage being a SELECT *? I don't think there's any SQL involved here.
I was being rhetorical. The R in RAG is filtering augmentation data (the A) for things that might or might not be related to the query. Including everything is just a lazy form of RAG -- the rhetorical SELECT *.
>and adding that as a system/user prompt to the LLM at inference time
You understand this is all RAG is, right? RAG is any additional system to provide contextually relevant (and often more timely) supporting information to a baked model.
Anyways, I think this thread has reached a conclusion and there really isn't much more value in it. Cheers.
Because llms still suck at actually using all that context at once. And surely you can see yourself that your solution doesn't scale. It's great that it works for your specific case but I'm sure you can come up with a scenario where it's just not feasible.