We're processing tenders for the construction industry - this comes with a 'free' bucket sort from the start, namely that people practically always operate only on a single tender.

Still, that single tender can be on the order of a billion tokens. Even if the LLM supported that insane context window, it's roughly 4GB that need to be moved and with current LLM prices, inference would be thousands of dollars. I detailed this a bit more at https://www.tenderstrike.com/en/blog/billion-token-tender-ra...

And that's just one (though granted, a very large) tender.

For the corpus of a larger company, you'd probably be looking at trillions of tokens.

While I agree that delivering tiny, chopped up parts of context to the LLM might not be a good strategy anymore, sending thousands of ultimately irrelevant pages isn't either, and embeddings definitely give you a much superior search experience compared to (only) classic BM25 text search.

I work at an AI startup, and we've explored a solution where we preprocess documents to make a short summary of each document, then provide these summaries with a tool call instruction to the bot so it can decide which document is relevant. This seems to scale to a few hundred documents of 100k-1m tokens, but then we run into issues with context window size and rot. I've thought about extending this as a tree based structure, kind of like an LLM file system, but have other priorities at the moment.

Embeddings had some context size limitations in our case - we were looking at large technical manuals. Gemini was the first to have a 1m context window, but for some reason its embedding window is tiny. I suspect the embeddings might start to break down when there's too much information.

For anyone unfamiliar, construction tenders are part of the project bidding process and appear to be a structured and formal manner in which contractors submit bids for large projects.