This glosses over a fundamental scaling problem that undermines the entire argument. The author's main example is Claude Code searching through local codebases with grep and ripgrep, then extrapolates this to claim RAG is dead for all document retrieval. That's a massive logical leap.
Grep works great when you have thousands of files on a local filesystem that you can scan in milliseconds. But most enterprise RAG use cases involve millions of documents across distributed systems. Even with 2M token context windows, you can't fit an entire enterprise knowledge base into context. The author acknowledges this briefly ("might still use hybrid search") but then continues arguing RAG is obsolete.
The bigger issue is semantic understanding. Grep does exact keyword matching. If a user searches for "revenue growth drivers" and the document discusses "factors contributing to increased sales," grep returns nothing. This is the vocabulary mismatch problem that embeddings actually solve. The author spent half the article complaining about RAG's limitations with this exact scenario (his $5.1B litigation example), then proposes grep as the solution, which would perform even worse.
Also, the claim that "agentic search" replaces RAG is misleading. Recent research shows agentic RAG systems embed agents INTO the RAG pipeline to improve retrieval, they don't replace chunking and embeddings. LlamaIndex's "agentic retrieval" still uses vector databases and hybrid search, just with smarter routing.
Context windows are impressive, but they're not magic. The article reads like someone who solved a specific problem (code search) and declared victory over a much broader domain.
I agree.
A great many pundits don't get, that RAG means: "a technique that enables large language models (LLMs) to retrieve and incorporate new information"
So, RAG is a pattern that is as a principle applied to almost every process. Context windows? Ok, I won't get into all the nitty gritty details here (embedded, small storage device, security, RAM defects, cost and storage of contexts for different contexts etc.), just a hint, that the act of filling a context is what? Applied RAG.
RAG is not a architecture, it is a principle. A structured approach. There is a reason, why nowadays many refer to RAG as search engine.
All we know about knowledge, there is only one entity with a infinite context window. We still call it God not cloud.
Indeed, the name is Retrieval Augmented Generation... so this is generation (synthesis of text) augmented by retrieval (of data from external systems). the goal is to augment the generation, not to improve retrieval.
the improvements needed for the retrieval part are then another topic.
Code is also unique in its suitability for agentic grep retrieval, especially when combined with a language server. Code enforces structure, semantics, and consistency in a way that is much easier to navigate than the complexities of natural language.
Yeah RAG doesn't say what its retrieving from, retrieving with grep is still RAG.
Agentic retrieval is really more a form of deep research (from a product standpoint there is very little difference). The key is that LLMs > rerankers, at least when you're not at webscale where the cost differential is prohibitive.
LLMs > rerankers. Yes! I don't like rerankers. They are slow, the context window is small (4096 tokens), it's expensive... It's better when the LLM reads the whole file versus some top_chunks.
Rerankers are orders of magnitude faster and cheaper than LLMs. Typical latency out of the box on a decent sized cross encoder (~4B) will be under 50ms on cheap gpus like an A10G. You won’t be able to run a fancy LLM on that hardware and without tuning you’re looking at hundreds of ms minimum.
More importantly, it’s a lot easier to fine tune a reranker on behavior data than an LLM that makes dozens of irrelevant queries.
This is worth emphasizing. At scale, and when you have the resources to really screw around with them to tune your pipeline, rerankers aren't bad, they're just much worse/harder to use out of the box. LLMs buy you easy robustness, baseline quality and capabilities in exchange for cost and latency, which is a good tradeoff until you have strong PMF and you're trying to increase margins.
More than that, adding longer context isn’t free either in time or money. So filling an LLM context with k=100 documents of mixed relevance may be slower than reranking and filling with k=10 of high relevance.
Of course, the devil is in the details and there’s five dozen reasons why you might choose one approach over the other. But it is not clear that using a reranker is always slower.
Is letting an agent use grep not a form of RAG? I know usually RAG is done with vector databases but grep is definitely a form of retrieval, and it’s augmenting the generation.
>The author spent half the article complaining about RAG's limitations with this exact scenario (his $5.1B litigation example), then proposes grep as the solution, which would perform even worse.
Yeah I found this very confusing. Sad to see such a poor quality article being promoted to this extent.
RAG doesn’t just mean word vectors but can include keyword search. Claude using grep is a form of RAG.
In practice this is not how the term is used.
It bugs me, because the acronym should encompass any form of retrieval - but in practice, people use RAG to specifically refer to embedding-vector-lookups, hence it making sense to say that it's "dying" now that other forms of retrieval are better.
This was essentially my response as well, but the other replies to you also have a point, and I think the key here is the 'Retrieval' in RAG is very vague, and depending on who you were and what you were getting into RAG for, the term means different things.
I am definitely more aligned with needing what I would rather call 'Deep Semantic Search and Generation' - the ability to query text chunk embeddings of... a 100k PDF's, using the semantics to search for the closeness of the 'ideas', those fed into the context of the LLM, and then the LLM generate a response to the prompt citing the source PDF(s) the closest matched vectors came from...
That is the killer app of a 'deep research' assistant IMO and you don't get that via just grepping words and feeding related files into the context window.
The downside is, how to generate embeddings of massive amounts of mixed-media files and store in a database quickly and cheaply compared to just grepping a few terms from said files? A CPU grep of text in files in RAM is like five orders of magnitude faster than an embedding model on the GPU generating semantic embeddings of the chunked file and then storing those for later.
But couldn’t an LLM search for documents in that enterprise knowledge base just like humans do, using the same kind of queries and the same underlying search infrastructure?
I wouldn't say humans are efficient at that so no reason to copy, other than as a starting point.
Maybe not efficient, but if the LLMs can't even reach this benchmark then I'm not sure.
Yes but that would be worse than many RAG approaches, which were implemented precisely because there is no good way to cleanly search through a knowledge base for a million different reasons.
At that point, you are just doing Agentic RAG, or even just Query Review + RAG.
I mean, yeah, agentic RAG is the future. It's still RAG though.
Appreciate the feedback. I’m not saying grep replaces RAG. The shift is that bigger context windows let LLMs just read whole files, so you don’t need the whole chunk/embed pipeline anymore. Grep is just a quick way to filter down candidates.
From there the model can handle 100–200 full docs and jot notes into a markdown file to stay within context. That’s a very different workflow than classic RAG.
That's fair, but how do you grep down to the right 100-200 documents from millions without semantic understanding? If someone asks "What's our supply chain exposure?" grep won't find documents discussing "vendor dependencies" or "sourcing risks."
You could expand grep queries with synonyms, but now you're reimplementing query expansion, which is already part of modern RAG. And doing that intelligently means you're back to using embeddings anyway.
The workflow works great for codebases with consistent terminology. For enterprise knowledge bases with varied language and conceptual queries, grep alone can't get you to the right candidates.
the agent greps for the obvious term or terms, reads the resulting documents, discovers new terms to grep for, and the process repeats until its satisfied it has enough info to answer the question
> You could expand grep queries with synonyms, but now you're reimplementing query expansion, which is already part of modern RAG.
in this scenario "you" are not implementing anything - the agent will do this on its own
this is based on my experience using claude code in a codebase that definitely does not have consistent terminology
it doesn't always work but it seemed like you were thinking in terms of trying to get things right in a single grep when it's actually a series of greps that are informed by the results of previous ones
Classical search
Which is RAG. How you decide to take a set of documents to large for an LLM context window and narrow it down to a set that does fit is an implementation issue.
The chunk, embed, similarity search method was just a way to get a decent classical search pipeline up and running with not too much effort.
I think the most important insight from your article, which I also felt, is that agentic search is really different. The ability to retarget a search iteratively fixes both the issues of RAG and grep approaches - they don't need to be perfect from the start, they only need to get there after 2-10 iterations. This really changes the problem. LLMs have become so smart they can compensate for chunking and not knowing the right word.
But on top of this I would also use AI to create semantic maps, like hierarchical structure of content, and put that table of contents in the context, let the AI explore it. This helps with information spread across documents/chapters. It provides a directory to access anything without RAG, by simply following links in a tree. Deep Research agents build this kind of schema while they operate across sources.
To explore this I built an graph MCP memory system where the agent can search both by RAG and text matching, and when it finds top-k nodes it can expand out by links. Writing a node implies having the relevant nodes first loaded up, and when generating the text, place contextual links embedded [1] like this. So simply writing a node also connects it to the graph in all the right points. This structure fits better with the kind of iterative work LLMs do.
I was previously working at https://autonomy.computer, and building out a platform for autonomous products (i.e., agents) there. I started to observe a similar opportunity. We had an actor-based approach to concurrency that meant it was super cheap performance-wise to spin up a new agent. _That_ in turn meant a lot of problems could suddenly become embarrassingly parallel, and that rather than pre-computing/caching a bunch of stuff into a RAG system you could process whatever you needed in a just-in-time approach. List all the documents you've got, spawn a few thousand agents and give each a single document to process, aggregate/filter the relevant answers when they come back.
Obviously that's not the optimal approach for every use case, but there's a lot where IMO it was better. In particular I was hoping to spend more time exploring it in an enterprise context where you've got complicated sharing and permission models to take into consideration. If you have agents simply passing through the permission of the user executing the search whatever you get back is automatically constrained to only the things they had access to in that moment. As opposed to other approaches where you're storing a representation of data in one place, and then trying to work out the intersection of permissions from one of more other systems, and sanitise the results on the way out. Always seemed messy and fraught with problems and the risk of leaking something you shouldn't.
I don't get it. Isn't grep RAG?
In RAG, you operate on embeddings and perform vector search, so if you search for fat lady, it might also retrieve text like huge queen, because they're semantically similar. Grep on the other hand, only matches exact strings, so it would not find it.
R in RAG is for retrieval… of any kind. It doesn’t have to be vector search.
Sure, but vector search is the dominant form of RAG, the rest are niche. Saying "RAG doesn’t have to use vectors" is like saying "LLMs don't have to use transformers". Technically true, but irrelevant when 99% of what's in use today does.
How are they niche? The default mode of search for most dedicated RAG apps nowadays is hybrid search that blends classical BM-25 search with some HNSW embedding search. That's already breaking the definition.
A search is a search. The architecture doesn't care if it's doing an vector search or a text search or a keyword search or a regex search, it's all the same. Deploying a RAG app means trying different search methods, or using multiple methods simultaneously or sequentially, to get the best performance for your corpus and use case.
Most hybrid stacks (BM25 + dense via HNSW/IVF) still rely on embeddings as a first class signal. So in practice the vector side carries recall on paraphrase/synonymy/OOO vocab, while BM25 stabilizes precision on exact term and short doc cases. So my point still stands.
> The architecture doesn't care
The architecture does care because latency, recall shape, and failure modes differ.
I don't know of any serious RAG deployments that don't use vectors. I'm referring to large scale systems, not hobby projects or small sites.
This isn't the case.
RAG means any kind of data lookup which improves LLM generation results. I work in this area and speak to tons of companies doing RAG and almost all these days have realised that hybrid approaches are way better than pure vector searches.
Standard understanding of RAG now is simply adding any data to the context to improve the result.
Not to mention, unless you want to ship entire containers, you are beholden to the unknown quirks of tools on whatever system your agent happens to execute on. It's like taking something already nondeterministic and extremely risky and ceding even more control—let's all embrace chaos.
Generative AI is here to stay, but I have a feeling we will look back on this period of time in software engineering as a sort of dark age of the discipline. We've seemingly decided to abandon almost every hard won insight and practice about building robust and secure computational systems overnight. It's pathetic that this industry so easily sold itself to the illogical sway of marketers and capital.
Mostly, i agree, except that the industry (from where I'm standing) has never done much else but sell itself to marketers and capital.
> It's pathetic that this industry so easily sold itself to the illogical sway of marketers and capital.
What are you implying. Capital always owned the industry except some really small coops and FOSS communities.
Isn't grep + LLM a form of RAG anyway?
Yes, this guy's post came up on my LinkedIn. I think it's helpful to consider the source in these types of articles, written by a CEO at a fintech startup (looks like AI generated too). It's obvious from reading the article that he doesn't understand what he's talking about and has likely never created any kind of RAG or other system. He has a very limited experience, basically a single project, of building a system around rudimentary ingestion of SEC filings, that's his entire breath of technical experience on the subject. So take what you read with a grain of salt, and do your own research and testing.
It really depends on what you mean by RAG. If you take the acronym at face value yeah.
However, RAG has been used as a stand in for a specific design pattern where you retrieve data at the start of a conversation or request and then inject that into the request. This simple pattern has benefits compared to just using sending a prompt by itself.
The point the author is trying to make is that this pattern kind of sucks compared to Agentic Search, where instead of shoving a bunch of extra context in at the start you give the model the ability to pull context in as needed. By switching from a "push" to a "pull" pattern, we allow the model to augment and clarify the queries it's making as it goes through a task which in turn gives the model better data to work with (and thus better results).
I guess, but with a very basic form of exact match retreival. The embedding based RAG tries to augment the prompt with extra data that is semantically similar instead of just exactly same.
Yeah 100%
Almost all tool calls would result in rag.
Rag is dead just means rolling out your own search and manually injecting results into context is dead (just use tools). It means the chunking techniques are dead.
Chunking is still relevant, because you want your tool calls to return results specific to the needs of the query.
If you want to know "how are tartans officially registered" you don't want to feed the entire 554kb wikipedia article on Tartan to your model, using 138,500 tokens, over 35% of gpt-5's context window, with significant monetary and latency cost. You want to feed it just the "Regulation>Registration" subsection and get an answer 1000x cheaper and faster.
but you could. For that example, you could just use a much cheaper model since it's not that complicated a question, and just pass the entire article. Just use gemini flash for example. Models will only get cheaper and context windows only get bigger
I've seen it called "agentic search" while RAG seems to have become synonymous with semantic search via embeddings
That's a silly distinction to make, because there's nothing stopping you from giving an agent access to a semantic search.
If I make a semantic search over my organization's Policy As Code procedures or whatever and give it to Claude Code as an MCP, does Claude Code suddenly stop being agentic?
Well yeah RAG just specifies retrieval augmented, not that vector retrieval or decoder retrieval was used
> Grep works great when you have thousands of files on a local filesystem that you can scan in milliseconds. But most enterprise RAG use cases involve millions of documents across distributed systems
Great point, but this grep in a loop probably falls apart (i.e. becomes non-performant) at 1000s of docs, not millions and 10s of simultaneous users
Why does grep in a loop fall apart? It’s expensive, sure, but LLM costs are trending toward zero. With Sonnet 4.5, we’ve seen models get better at parallelization and memory management (compacting conversations and highlighting findings).
If LLM costs are trending towards zero, please explain the $600B openai when Oracle and the $100B deal with Nvidia.
And if you think those deals are bogus, like I do, you still need to explain surging electricity prices.
"LLM costs are trending toward zero". They will never be zero for the cutting edge. One could argue that costs are zero now via local models but enterprises will always want the cutting edge which is likely to come with a cost
They're not trending toward zero; they're just aggressively subsidized with oil money.
Cursor’s use of grep is bad. It finds definitions way slower and less accurately than I do using IDE indexing, which is frustratingly “right there.” Crazy that there’s not even LSP support in there.
Claude Code is better, but still frustrating.
What exactly is RAG? Is it a specific technology, or a technique?
I'm not a super smart AI person, but grepping through a codebase sounds exactly like what RAG is. Isn't tool use just (more sophisticated) RAG?
Yes, you are right. The OP has a weirdly narrow definition of what RAG is.
Only the most basic "hello world" type RAG systems rely exclusively on vector search. Everybody has been doing hybrid search or multiple simultaneous searches exposed through tools for quite some time now.
RAG is a technique, so instead of string matching (like grep), it uses embeddings + vector search to retrieve semantically similar text (car ≈ automobile), then feeds that into the LLM. Tool use is broader, RAG is one pattern within that, but not the same as grep.
Yeah, 'RAG' is quite literal tool use, where the tool is a vector search engine more or less.
What was described as 'RAG' a year ago now is a 'knowledge search in vector db MCP', with the actual tool and mechanism of knowledge retrieval being the exact same.