> It is much more token efficient
Is it? Aren't input tokens are like 1000x cheaper than output tokens? That's why they can do this memory stuff in the first place.
> It is much more token efficient
Is it? Aren't input tokens are like 1000x cheaper than output tokens? That's why they can do this memory stuff in the first place.
What I mean is that you want the total number of tokens to convey the information to the LLM to be as small as possible. If you’re having a discussion, you’ll have (perhaps incorrect) responses from the LLM in there, have to correct it, etc. All this is wasteful, and may even confuse the LLM. It’s much better to ensure all the information is densely packed in the original message.
They're around 10x cheaper than output, and 100x if they're cached.