Hacker News

That is not particularly cheap, especially since it scales linearly with doc size, and therefore time.

Additionally the quality of loading the context-window decreases linearly as well, just because your model can handle 1M tokens it doesn't mean that it WILL remember 1M tokens, it just means that it CAN

RAG fixes this, in the simplest configuration a RAG can be an index, and the only context you give the LLM is the table of contents, and you let it search through the index.

Should it be a surprise that this is cheaper and more efficient? Loading the context window is like a library having every book open at every page at the same time instead of using the dewey decimal system