Appreciate the feedback. I’m not saying grep replaces RAG. The shift is that bigger context windows let LLMs just read whole files, so you don’t need the whole chunk/embed pipeline anymore. Grep is just a quick way to filter down candidates.

From there the model can handle 100–200 full docs and jot notes into a markdown file to stay within context. That’s a very different workflow than classic RAG.

That's fair, but how do you grep down to the right 100-200 documents from millions without semantic understanding? If someone asks "What's our supply chain exposure?" grep won't find documents discussing "vendor dependencies" or "sourcing risks."

You could expand grep queries with synonyms, but now you're reimplementing query expansion, which is already part of modern RAG. And doing that intelligently means you're back to using embeddings anyway.

The workflow works great for codebases with consistent terminology. For enterprise knowledge bases with varied language and conceptual queries, grep alone can't get you to the right candidates.

the agent greps for the obvious term or terms, reads the resulting documents, discovers new terms to grep for, and the process repeats until its satisfied it has enough info to answer the question

> You could expand grep queries with synonyms, but now you're reimplementing query expansion, which is already part of modern RAG.

in this scenario "you" are not implementing anything - the agent will do this on its own

this is based on my experience using claude code in a codebase that definitely does not have consistent terminology

it doesn't always work but it seemed like you were thinking in terms of trying to get things right in a single grep when it's actually a series of greps that are informed by the results of previous ones

Classical search

Which is RAG. How you decide to take a set of documents to large for an LLM context window and narrow it down to a set that does fit is an implementation issue.

The chunk, embed, similarity search method was just a way to get a decent classical search pipeline up and running with not too much effort.

I think the most important insight from your article, which I also felt, is that agentic search is really different. The ability to retarget a search iteratively fixes both the issues of RAG and grep approaches - they don't need to be perfect from the start, they only need to get there after 2-10 iterations. This really changes the problem. LLMs have become so smart they can compensate for chunking and not knowing the right word.

But on top of this I would also use AI to create semantic maps, like hierarchical structure of content, and put that table of contents in the context, let the AI explore it. This helps with information spread across documents/chapters. It provides a directory to access anything without RAG, by simply following links in a tree. Deep Research agents build this kind of schema while they operate across sources.

To explore this I built an graph MCP memory system where the agent can search both by RAG and text matching, and when it finds top-k nodes it can expand out by links. Writing a node implies having the relevant nodes first loaded up, and when generating the text, place contextual links embedded [1] like this. So simply writing a node also connects it to the graph in all the right points. This structure fits better with the kind of iterative work LLMs do.

I was previously working at https://autonomy.computer, and building out a platform for autonomous products (i.e., agents) there. I started to observe a similar opportunity. We had an actor-based approach to concurrency that meant it was super cheap performance-wise to spin up a new agent. _That_ in turn meant a lot of problems could suddenly become embarrassingly parallel, and that rather than pre-computing/caching a bunch of stuff into a RAG system you could process whatever you needed in a just-in-time approach. List all the documents you've got, spawn a few thousand agents and give each a single document to process, aggregate/filter the relevant answers when they come back.

Obviously that's not the optimal approach for every use case, but there's a lot where IMO it was better. In particular I was hoping to spend more time exploring it in an enterprise context where you've got complicated sharing and permission models to take into consideration. If you have agents simply passing through the permission of the user executing the search whatever you get back is automatically constrained to only the things they had access to in that moment. As opposed to other approaches where you're storing a representation of data in one place, and then trying to work out the intersection of permissions from one of more other systems, and sanitise the results on the way out. Always seemed messy and fraught with problems and the risk of leaking something you shouldn't.