I’ve been working on RAG systems a lot this year and I think one thing people miss is that often for internal RAG efficiency/latency is not the main concern. You want predictable, linear pricing of course, but sometimes you want to simply be able to get a predictably better response by throwing a bit more money/compute time at it.

It’s really hard to get to such a place with standard vector-based systems, even GraphRag. Because it relies on summaries of topic clusters that are pre-computed, if one of those summaries is inaccurate or none of the summaries deal with your exact question, that will never change during query processing. Moreover, GraphRag preprocessing is insanely expensive and precisely does not scale linearly with your dataset.

TLDR all the trade-offs in RAG system design are still being explored, but in practice I’ve found the main desired property to be “predictably better answer with predictably scaling cost” and I can see how similar concerns got OP to this design.

> Moreover, GraphRag preprocessing is insanely expensive and precisely does not scale linearly with your dataset.

Sounds interesting. What exactly is the expensive computation?

On a separate note: I have a feeling RAG could benefit from a kind of ”simultaneous vector search” across several different embedding spaces, sort of like AND in an SQL database. Do you agree?

GraphRAG does full entity extraction across the entire data set, then looks at every relation between those entities in the documents, then looks at every “community” of relations and generates narratives/descriptions for everything at all of those levels. That is… not linear scaling in relation to your data to say the least — and because questions will be answered on the basis of this preprocessing you don’t want to just use the stupidest/cheapest LLM available. It adds up pretty quickly — and most of the preprocessing turns out to be useless for questions you’ll ask. The OP’s approach is more expensive per query, but you’re more likely to get good results for that particular question.

Yes, in the use case we're doing it's been diagnosis of issues, and draws on documents in that. the latency doesn't matter because it's all done before the diagnosis is raised to the customer.

> You want predictable, linear pricing of course, but sometimes you want to simply be able to get a predictably better response by throwing a bit more money/compute time at it.

Through more thorough ANN vector search / higher recall, or would it also require different preprocessing?

Honestly I don’t know the best answer, but my sense is there’s something important in the direction the OP is going: I.e moving away from vector search or preprocessing towards dynamic exploration of the document space by an agent. Ultimately, if the content in one’s corpus develops in a linear manner (things build one after another), no vector search will ever work on its own, since you just get a however exhaustive list of every passage directly relevant to the question — but not how those relate to all the text before or after.

GraphRAG gets around this by preprocessing these “narrative” summaries of pretty much every combination of topics in a document: vector search then returns a combination of individual topic descriptions, relations between topic descriptions, raw excerpts from the data, and then such overarching “narratives.” This definitely works pretty well in general, but a lot of the narratives turn out to be pretty useless for the important questions and it’s expensive for preprocessing etc.

I think the area that hasn’t been explored enough is generating these narratives dynamically, ie more or less as the OP does having the agent simulate reading through every document with a question in mind and a log of possibly relevant issues. Obviously that’s expensive per query, but if you can get the right answer to an important question for less than the cost of a human’s time it’s worth it. GraphRAG preprocessing costs a lot (exponentially scales with data) and that cost doesn’t guarantee a good answer to any particular question.