Skimmed the repo, this is basically the irreducible core of an agent: small loop, provider abstraction, tool dispatch, and chat gateways . The LOC reduction (99%, from 400k to 4k) mostly comes from leaving out RAG pipelines, planners, multi-agent orchestration, UIs, and production ops.
RAG seems odd when you can just have a coding agent manage memory by managing folders. Multi agent also feels weird when you have subagents.
Yeah, vector embeddings based RAG has fallen out of fashion somewhat.
It was great when LLMs had 4,000 or 8,000 token context windows and the biggest challenge was efficiently figuring out the most likely chunks of text to feed into that window to answer a question.
These days LLMS all have 100,000+ context windows, which means you don't have to be nearly as selective. They're also exceptionally good at running search tools - give them grep or rg or even `select * from t where body like ...` and they'll almost certainly be able to find the information they need after a few loops.
Vector embeddings give you fuzzy search, so "dog" also matches "puppy" - but a good LLM with a search tool will search for "dog" and then try a second search for "puppy" if the first one doesn't return the results it needs.
The fundamental problem wit RAG is that it extracts only surface level features, "31+24" won't embed close to "55", while "not happy" will be close to "happy". Another issue is that embedding similarity does not indicate logical dependency, you won't retrieve the callers of a function with RAG, you need a LLM or code for that. Third issue is chunking, to embed you need to chunk, but if you chunk you exclude information that might be essential.
The best way to search I think is a coding agent with grep and file system access, and that is because the agent can adapt and explore instead of one shotting it.
I am making my own search tool based on the principle of LoD (level of detail) - any large text input can be trimmed down to about 10KB size by doing clever trimming, for example you could trim the middle of a paragraph keeping the start and end, or you could trim the middle of a large file. Then an agent can zoom in and out of a large file. It skims structure first, then drills into the relevant sections. Using it for analyzing logs, repos, zip files, long PDFs, and coding agent sessions which can run into MB size. Depending on content type we can do different types of compression for code and tree structured data. There is also a "tall narrow cut" (like cut -c -50 on a file).
The promise is - any size input fit into 10KB "glances" and the model can find things more efficiently this way without loading the whole thing.
Ok 2 hours later here is the release: https://github.com/horiacristescu/nub
This is a very cool idea. I’ve been dragging CC around very large code bases with a lot of docs and stuff. it does great but can be a swing and a miss.. have been wondering if there is a more efficient / effective way. This got me thinking. Thanks for sharing!
Context rot is still a problem though, so maybe vector search will stick around in some form. Perhaps we will end up with a tool called `vector grep` or `vg` that handles the vectorized search independent of the agent.
I've been leaning towards multi agent because sub agent relies on the main agent having all the power and using it responsibly.
Totally useless indeed.
Interesting.
I guess RAG is faster? But I'm realizing I'm outdated now.
No, RAG is definitely preferable once your memory size grows above a few hundred lines of text (which you can just dump into the context for most current models), since you're no longer fighting context limits and needle-in-a-haystack LLM retrieval performance problems.
> once your memory size grows above a few hundred lines of text (which you can just dump into the context for most current models)
A few hundred lines of text is nothing for current LLMs.
You can dump the entire contents of The Great Gatsby into any of the frontier LLMs and it’s only around 70K tokens. This is less than 1/3 of common context window sizes. That’s even true for models I run locally on modest hardware now.
The days of chunking everything into paragraphs or pages and building complex workflows to store embeddings, search, and rerank in a big complex pipeline are going away for many common use cases. Having LLMs use simpler tools like grep based on an array of similar search terms and then evaluating what comes up is faster in many cases and doesn’t require elaborate pipelines built around specific context lengths.
Yes, but how good will the recall performance be? Just because your prompt fits into context doesn't mean that the model won't be overwhelmed by it.
When I last tried this with some Gemini models, they couldn't reliably identify specific scenes in a 50K word novel unless I trimmed down the context to a few thousands of words.
> Having LLMs use simpler tools like grep based on an array of similar search terms and then evaluating what comes up is faster in many cases
Sure, but then you're dependent on (you or the model) picking the right phrases to search for. With embeddings, you get much better search performance.
> Yes, but how good will the recall performance be? Just because your prompt fits into context doesn't mean that the model won't be overwhelmed by it.
With current models it's very good.
Anthropic used a needle-in-haystack example with The Great Gatsby to demonstrate the performance of their large context windows all the way back in 2023: https://www.anthropic.com/news/100k-context-windows
Everything has become even better in the nearly 3 years since then.
> Sure, but then you're dependent on (you or the model) picking the right phrases to search for. With embeddings, you get much better search performance.
How do are those embeddings generated?
You're dependent on the embedding model to generate embeddings the way you expect.
That doesn’t match my experience, both in test and actual usage scenarios.
Gemini 3 Pro fails to satisfy pretty straightforward semantic content lookup requests for PDFs longer than a hundred pages for me, for example.
> for PDFs longer than a hundred pages for me
Your original comment that I responded to said a "few hundred lines of text", not hundred page PDFs.
I think it still has a place of your agent is part of a bigger application that you are running and you want to quickly get something in your models context for a quick turnaround
Unless I'm misunderstanding what they are, planners seem kind of important.
As you mentioned, that depends on what you mean by planners.
An LLM will implicitly decompose a prompt into tasks and then sequentially execute them, calling the appropriate tools. The architecture diagram helpfully visualizes this [0]
Here though, planners means autonomous planners that exist as higher level infrastructure, that does external task decomposition, persistent state, tool scheduling, error recovery/replanning, and branching/search. Think a task like “Prompt: “Scan repo for auth bugs, run tests, open PR with fixes, notify Slack.” that just runs continuously 24/7, that would be beyond what nanobot could do. However, something like “find all the receipts in my emails for this year, then zip and email them to my accountant for my tax return” is something nanobot would do.
[0] https://github.com/HKUDS/nanobot/blob/main/nanobot_arch.png
Sure, instruction tuned models implicitly plan, but they can easily lose the plot on long contexts. If you're going to have an agent running continuously and accumulating memory (parsing results from tool use, web fetches, previous history, etc.), then plan decomposition, persistence and error recovery seems like a good idea, so you can start subagents with fresh contexts for task items and they stay on task or can recover without starting everything over again. Also seems better for cost since input and output contexts are more bounded.
I don’t know what these planners do, but I’ve had reasonably good luck asking a coding agent to write a design doc and then reviewing it a few times.
RAG is broken when you have too much data.
Specifically when the document number reaches around 10k+, a phenomenon called "Semantic Collapse" occurs.
https://dho.stanford.edu/wp-content/uploads/Legal_RAG_Halluc...
So you're telling me rampancy ( https://www.halopedia.org/Rampancy ) is real.
> Specifically when the document number reaches around 10k+
Where are you getting this? just read the paper and not seeing it -- interested to learn more
The RAG GP used suffered from semantic collapse.
Gemini with Google search is RAG using all public data, and it isn't broken.
It's not tool use with natural language search queries? That's what I'd expect.
It's RAG via tool use, where the storage and retreival method is an implementation detail.
I'm not a huge fan of the term RAG though because if you squint almost all tool use could be considered RAG.
But if you stick with RAG being a form of "knowledge search" then I think Google search easily fits.
It is tool use with natural language search queries but going down a layer they are searched on a vector DB, very similar to RAG. Essentially Google RankBrain is the very far ancestor to RAG before compute and scaling.
Cant you make thresholds higher?
Hmm... I guess not, you might want all that data.
Super interesting topic. Learning a lot.