So if I understand this correctly it goes over every possible document with an LLM each time someone performs a search?

I might have misunderstood of course.

If so, then the use cases for this would be fairly limited since you'd have to deal with lots of latency and costs. In some cases (legal documents, medical records, etc) it might be worth it though.

An interesting alternative I've been meaning to try out is inverting this flow. Instead of using an LLM at time of searching to find relevant pieces to the query, you flip it around: at time of ingesting you let an LLM note all of the possible questions that you can answer with a given text and store those in an index. You could them use some traditional full-text search or other algorithms (BM25?) to search for relevant documents and pieces of text. You could even go for a hybrid approach with vectors on top or next to this. Maybe vectors first and then more ranking with something more traditional.

What appeals to me with that setup is low latency and good debug-ability of the results.

But as I said, maybe I've misunderstood the linked approach.

>An interesting alternative I've been meaning to try out is inverting this flow. Instead of using an LLM at time of searching to find relevant pieces to the query, you flip it around: at time of ingesting you let an LLM note all of the possible questions that you can answer with a given text and store those in an index.

You may already know of this one, but consider giving Google LangExtract a look. A lot of companies are doing what you described in production, too!

This is just a variation of index time HyDE (Hypothetical Document Embedding). I used a similar strategy when building the index and search engine for findsight.ai

I’ve been working on RAG systems a lot this year and I think one thing people miss is that often for internal RAG efficiency/latency is not the main concern. You want predictable, linear pricing of course, but sometimes you want to simply be able to get a predictably better response by throwing a bit more money/compute time at it.

It’s really hard to get to such a place with standard vector-based systems, even GraphRag. Because it relies on summaries of topic clusters that are pre-computed, if one of those summaries is inaccurate or none of the summaries deal with your exact question, that will never change during query processing. Moreover, GraphRag preprocessing is insanely expensive and precisely does not scale linearly with your dataset.

TLDR all the trade-offs in RAG system design are still being explored, but in practice I’ve found the main desired property to be “predictably better answer with predictably scaling cost” and I can see how similar concerns got OP to this design.

> Moreover, GraphRag preprocessing is insanely expensive and precisely does not scale linearly with your dataset.

Sounds interesting. What exactly is the expensive computation?

On a separate note: I have a feeling RAG could benefit from a kind of ”simultaneous vector search” across several different embedding spaces, sort of like AND in an SQL database. Do you agree?

GraphRAG does full entity extraction across the entire data set, then looks at every relation between those entities in the documents, then looks at every “community” of relations and generates narratives/descriptions for everything at all of those levels. That is… not linear scaling in relation to your data to say the least — and because questions will be answered on the basis of this preprocessing you don’t want to just use the stupidest/cheapest LLM available. It adds up pretty quickly — and most of the preprocessing turns out to be useless for questions you’ll ask. The OP’s approach is more expensive per query, but you’re more likely to get good results for that particular question.

Yes, in the use case we're doing it's been diagnosis of issues, and draws on documents in that. the latency doesn't matter because it's all done before the diagnosis is raised to the customer.

> You want predictable, linear pricing of course, but sometimes you want to simply be able to get a predictably better response by throwing a bit more money/compute time at it.

Through more thorough ANN vector search / higher recall, or would it also require different preprocessing?

Honestly I don’t know the best answer, but my sense is there’s something important in the direction the OP is going: I.e moving away from vector search or preprocessing towards dynamic exploration of the document space by an agent. Ultimately, if the content in one’s corpus develops in a linear manner (things build one after another), no vector search will ever work on its own, since you just get a however exhaustive list of every passage directly relevant to the question — but not how those relate to all the text before or after.

GraphRAG gets around this by preprocessing these “narrative” summaries of pretty much every combination of topics in a document: vector search then returns a combination of individual topic descriptions, relations between topic descriptions, raw excerpts from the data, and then such overarching “narratives.” This definitely works pretty well in general, but a lot of the narratives turn out to be pretty useless for the important questions and it’s expensive for preprocessing etc.

I think the area that hasn’t been explored enough is generating these narratives dynamically, ie more or less as the OP does having the agent simulate reading through every document with a question in mind and a log of possibly relevant issues. Obviously that’s expensive per query, but if you can get the right answer to an important question for less than the cost of a human’s time it’s worth it. GraphRAG preprocessing costs a lot (exponentially scales with data) and that cost doesn’t guarantee a good answer to any particular question.

> An interesting alternative I've been meaning to try out is inverting this flow.

This is what I am doing with my AI Search Assistant feature, which I discuss in more detail via the link below:

https://github.com/gitsense/chat/blob/main/packages/chat/wid...

By default, I provide what I call a "Tiny Overview Analyzer". You can read the prompt for the Analyzer with the link below:

https://github.com/gitsense/chat/blob/main/packages/chat/wid...

In a nutshell, it generates a very short summary of every document along with keywords. The basic idea is to use BM25 ranking to identify the most relevant documents for the AI to review. For example, my use case is to understand how Aider, Claude Code, etc., store their conversations so that I can make them readable in my chat app. To answer this, I would ask 'How does Aider store conversations?' and the LLM would construct a deterministic keyword search using terms that would most likely identify how conversations are stored.

Once I have the list of files, the LLM is asked again to review the summaries of all matches and suggest which documents should be loaded in full for further review. I've found this approach to be inconsistent, however. What I've found to work much better is just loading the "Tiny Overview" summaries into context and chatting with the LLM. For example, I would ask the same question: "Which files do you think can tell me how Aider stores conversations? Identify up to 20 files and create a context bundle for them so I can load them into context." For a thousand files, you can easily fit three-sentence summaries for each of them without overwhelming the LLM. Once I have my answer, I just need a few clicks to load the files into context, and then the LLM will have full access to the file content and can better answer my question.

I didn't look at the implementation but sounds similar to something I two years ago recursively summarize the documentation based on structure (domain/page/section) and then ask the model to walk the hierarchy based on summaries.

My motivation back then I had 8k context length to work with so I had to be very conservative about what I include. I still used vectors to narrow down the entry points and then use LLM to drill down or pick the most relevant ones and the search threads were separate, would summarize the response based on the tree path they took and then main thread would combine it.

[deleted]

> let an LLM note all of the possible questions that you can answer

What does this even mean? At what point do you know you have all of them?

Humans are quite ingenious coming up with new, unique questions in my observation, whereas LLMs have a hard time replicating those efficiently.

Cantors diagonalization is trivial to show for questions. There are uncountably many.

you can use document search straedgies (like SQL metadata search, semantic search etc, doc descrption search by LLM) to narrow down the doc candidates first.

[dead]