Docs bots like these are deceptively hard to get right in production. Retrieval is super sensitive to how you chunk/parse documentation and how you end up structuring documentation in the first place (see frontpage post from a few weeks ago: https://news.ycombinator.com/item?id=44311217).
You want grounded RAG systems like Shopify's here to rely strongly on the underlying documents, but also still sprinkle a bit of the magic of the latent LLM knowledge too. The only way to get that balance right is evals. Lots of them. It gets even harder when you are dealing with GraphQL schema like Shopify has since most models struggle with that syntax moreso than REST APIs.
FYI I'm biased: Founder of kapa.ai here (we build docs AI assistants for +200 companies incl. Sentry, Grafana, Docker, the largest Apache projects etc).
Why do you say “deceptively hard” instead of “fundamentally impossible”? You can increase the probability it’ll give good answers, but you can never guarantee it. It’s then a question of what degree of wrongness is acceptable, and how you signal that. In this specific case, what it said sounds to me (as a Shopify non-user) entirely reasonable, it’s just wrong in a subtle but rather crucial way, which is also mildly tricky to test.
A human answering every question is also not guaranteed to give good answers; anyone that has communicated with customer service knows that. So calling it impossible may be correct, but not useful.
(We tend to have far fewer evals for such humans though.)
Is that not the very reason to have documentation in the first place, the fact that it is not a human?
A human will tell you “I am not sure, and will have to ask engineering and get back to you in a few days”. None of these LLMs do that yet, they’re biased towards giving some answer, any answer.
I agree with you, but man i can't help but feel humans are the same depending on the company. My wife was recently fighting with several layers of comcast support over cap changes they've recently made. Seemingly it's a data issue since it's something new that theoretically hasn't propagated through their entire support chain yet, but she encountered a half dozen confidently incorrect people which lacked the information/training to know that they're wrong. It was a very frustrating couple hours.
Generally i don't trust most low paid (at no fault of their own) customer service centers anymore than i do random LLMs. Historically their advice for most things is either very biased, incredibly wrong, or often both.
In the case of unhelpful human support, I can leverage my experience in communicating with another human to tell if I'm being understood or not. An LLM is much more trial-and-error: I can't model the theory-of-mind behind it's answers to tell if I'm just communicating poorly or whatever else may be being lost in translation, there is no mind at play.
That's fair, though with an LLM (at least one you're familiar with) you can shape it's behavior. Which is not too different compared to some black box script that i can't control or reason through with a human support. Granted the LLM will have the same stupid black box script, so in both cases it's weaponized stupidity against the consumer.
This is not really true. If you give a decent model docs in the prompt and tell them to answer based on the docs and say “I don’t know” if the answer isn’t there, they do it (most of the time).
> most of the time
This is doing some heavy lifting
I have never seen this in the wild. Have you?
Yes. All the time. I wrote a tool that does it!
https://crespo.business/posts/llm-only-rag/
I need a big ol' citation for this claim, bud, because it's an extraordinary one. LLMs have no concept of truth or theory of mind so any time one tells you "I don't know" all it tells you is that the source document had similar questions with the answer "I don't know" already in the training data.
If the training data is full of certain statements you'll get certain sounding statements coming out of the model, too, even for things that are only similar, and for answers that are total bullshit
Do you use LLMs often?
I get "I don't know" answers from Claude and ChatGPT all the time, especially now that they have thrown "reasoning" into the mix.
Saying that LLMs can't say "I don't know" feels like a 2023-2024 era complaint to me.
Ok, how? The other day Opus spent 35 of my dollars by throwing itself again and again at a problem it couldn't solve. How can I get it to instead say "I can't solve this, sorry, I give up"?
That sounds slightly different from "here is a question, say I don't know if you don't know the answer" - sounds to me like that was Opus running in a loop, presumably via Claude Code?
I did have one problem (involving SQLite triggers) that I bounced off various LLMs for genuinely a full year before finally getting to an understanding that it wasn't solvable! https://github.com/simonw/sqlite-chronicle/issues/7
It wasn't in a loop really, it was more "I have this issue" "OK I know exactly why, wait" $3 later "it's still there" "OK I know exactly why, it's a different reason, wait", repeat until $35 is gone and I quit.
I would have much appreciated if it could throw its hands up and say it doesn't know.
I solve this by in my prompt. I say if you can’t fix it in two tries look online on how to do it if you still can’t fix it after two tries pause and ask for my help. It works pretty well.
Won't that be cool, when LLM-based AIs ask you for help instead of the other way around
You’re right that some humans will, and most LLMs won’t. But humans can be just as confidently wrong. And we incentivize them to make decisions quickly, in a way that costs the company less money.
Documentation is the thing we created because humans are forgetful and misunderstand things. If the doc bot is to be held to a standard more like some random discord channel or community forum, it should be called something without “doc” in the name (which, fwiw, might just be a name the author of the post came up with, I dunno what Shopify calls it).
This is to move the goal posts /raise a different issue. We can engage with the new point, but this is to concede that Docs bots are not docs bots.
Why RAG at all?
We concatenated all our docs and tutorials into a text file, piped it all into the AI right along with the question, and the answers are pretty great. Cost was, last I checked, roughly 50c per question. Probably scales linearly with how much docs you have. This feels expensive but compared to a human writing an answer it's peanuts. Plus (assuming the customer can choose to use the AI or a human), it's great customer experience because the answer is there that much faster.
I feel like this is a no-brainer. Tbh with the context windows we have these days, I don't completely understand why RAG is a thing anymore for support tools.
This works as long as your docs are below the max context size (and even then, as you approach larger context sizes, quality degrades).
Re cost though, you can usually reduce the cost significantly with context caching here.
However, in general, I’ve been positively surprised with how effective Claude Code is at grep’ing through huge codebases.
Thus, I think just putting a Claude Code-like agent in a loop, with a grep tool on your docs, and a system prompt that contains just a brief overview of your product and brief summaries of all the docs pages, would likely be my go to.
Oh man, maybe this would cause people to write docs that are easy to grep through. Let’s start up that feedback loop immediately, please.
How will you grep synonyms or phrases with different word choices?
I’m hoping that the documentation will be structured in a way such that Claude can easily come up with good grep regexes. If Claude can do it, I can probably do it only a little bit worse.
What you describe sounds like poor man's RAG. Or lazy man's. You're just doing the augmentation at each prompt.
With RAG the cost per question would be low single-digit pennies.
Accuracy drops hard with context length still. Especially in more technical domains. Plus latency and cost.
That is not particularly cheap, especially since it scales linearly with doc size, and therefore time.
Additionally the quality of loading the context-window decreases linearly as well, just because your model can handle 1M tokens it doesn't mean that it WILL remember 1M tokens, it just means that it CAN
RAG fixes this, in the simplest configuration a RAG can be an index, and the only context you give the LLM is the table of contents, and you let it search through the index.
Should it be a surprise that this is cheaper and more efficient? Loading the context window is like a library having every book open at every page at the same time instead of using the dewey decimal system
What you described is RAG. Inefficient RAG, but still RAG.
And it's inefficient in two ways-
-you're using extra tokens for every query, which adds up.
-you're making the LLM less precise by overloading it with potentially irrelevant extra info making it harder for it to needle in a haystack the specific relevant answer.
Filtering (e.g. embedding similarity & BM25) and re-ranking/pruning what you provide to RAG is an optimization. It optimizes the tokens, the processing time, and optimizes the answer in an ideal world. Most LLMs are far more effective if your RAG is limited to what is relevant to the question.
I don't think it's RAG, RAG is specifically separating the search space from the LLM context-window or training set and giving the LLM tools to search in inference-time.
In this case their Retrieval stage is "SELECT *", basically, so sure I'm being loose with the terminology, but otherwise it's just a non-selective RAG. Okay ..AG.
RAG is selecting pertinent information to supply to the LLM with your query. In this case they decided that everything was pertinent, and the net result is just reduced efficiency. But if it works for them, eh.
I'm not sure we are talking about the same thing. The root comment talks about concatenating all doc files into a loong text string, and adding that as a system/user prompt to the LLM at inference time before the actual question.
You mention the retrieval stage being a SELECT *? I don't think there's any SQL involved here.
I was being rhetorical. The R in RAG is filtering augmentation data (the A) for things that might or might not be related to the query. Including everything is just a lazy form of RAG -- the rhetorical SELECT *.
>and adding that as a system/user prompt to the LLM at inference time
You understand this is all RAG is, right? RAG is any additional system to provide contextually relevant (and often more timely) supporting information to a baked model.
Anyways, I think this thread has reached a conclusion and there really isn't much more value in it. Cheers.
Because llms still suck at actually using all that context at once. And surely you can see yourself that your solution doesn't scale. It's great that it works for your specific case but I'm sure you can come up with a scenario where it's just not feasible.
Indeed. Dabbling in 'RAG' (which for better or worse has become a tag for anything context retrieval) for more complex documentation and more intricate questions, you will very quickly realize that you really need to go far beyond simple 'chunking', and end up with a subsystem that constructs more than one very intricate knowledge graphs for supporting different kinds of questions the users might ask. For example: a simple question such as "What exactly is an 'Essential Entity'? is better handled by Knowledge Representation A as opposed to "Can you provide a gap and risk analysis on my 2025 draft compliance statement (uploaded) in light of the current GDPR, NIS-2 and the AI Act?"
(My domain is regulatory compliance, so maybe this goes beyond pure documentation but I'm guessing pushed far enough the same complexities arise)
“It’s just a chat bot Michael, how much can it cost?”
A philosophy degree later…
I ended up just generating a summary of each of our 1k docs, using the summaries for retrieval, running a filter to confirm the doc is relevant, and finally using the actual doc to generate an answers.
This is sort of hilarious; to use an LLM as a good search interface first build.. a search engine.
I guess this is why Kagi Quick Answer has consistently been one of the best AI tools I use. The search is good, so their agent is getting the best context for the summaries. Makes sense.
It is building a system that amplifies the strengths of the LLM by feeding it the right knowledge in the right format at inference time. Context design is both a search (as a generic term for everything retrieval) and a representation problem.
Just dumping raw reams of text into the 'prompt' isn't the best way to great results. Now I am fully aware that anything I can do on my side of the API, the LLM provider can and eventually will do as well. After all, Search also evolved beyond 'pagerank' to thousands of specialized heuristic subsystems.