Full Disclosure: I built an indexing engine for Git and GitHub that can process repos at scale and my words should be taken with scepticism.

I think using MCP is an interesting idea, but the heavy lifting that can provide insights, is not with MCP. For fetch and search to work effectively, the MCP will need quality context to know what to consider. I'm biased, but I really looked into chunking documents, but given how the LLM landscape is evolving, I don't think chunking makes a lot sense any more (for code at least).

I've committed to generating short and long overviews for directories and files. Short overviews are two to three sentences. And long overviews are two to three paragraphs. Given how effectively newer LLMs can process 100,000 tokens or less, you can feed it a short overview for all files/directories to determine what files to sub query with. That is, what long overviews to load into context for the sub query.

I also believe most projects in the future will start to produce READMEs for LLMs that are verbose and not easy to grok for humans, but is rich in detail for LLMs. You may not want the LLM to generate the code for you, but the LLM can certainly help us navigate complex/unfamiliar code in a semantic manner, which can be game changer for onboarding.

That sounds really interesting! What got us into this project is the problem in with the LLM a large llms-full.txt file as a context, for example. We wanted to provide the agents an easy way to get the documentation for every repo (be it llms.txt, readme, etc) - but also search chunks of it using semantic search. Will be happy to chat more, if you like - sounds like we can benefit from bouncing ideas and notes