Hacker News

One thing to keep in mind is that the core of an LLM is basically a (non-deterministic) stateless function that takes text as input, and gives text as output.

The chat and session interfaces obscure this, making it look more stateful than it is. But they mainly just send the whole chat so far back to the LLM to get the next response. That's why the context window grows as a chat/session continues. It's also why the answers tend to get worse with longer context windows – you're giving the LLM a lot more to sift through.

You can manage the context window manually instead. You'll potentially lose some efficiencies from prompt caching, but you can also keep your requests much smaller and more relevant, likely spending fewer tokens.