Yes, your last paragraph is absolutely the key to great output: instead of entering a discussion, refine the original prompt. It is much more token efficient, and gets rid of a lot of noise.
I often start out with “proceed by asking me 5 questions that reduce ambiguity” or something like that, and then refine the original prompt.
It seems like we’re all discovering similar patterns on how to interact with LLMs the best way.
We sure are. We are all discovering context rot on our own timelines. One thing that has really helped me when working with LLMs is to notice when it begins looping on itself, asking it to summarize all pertinent information and to create a prompt to continue in a new conversation. I then review the prompt it provides me, edit it, and paste it into a new chat. With this approach I manage context rot and get much better responses.
The trick to do this well is to split the part of the prompt that might change and won't change. So if you are providing context like code, first have it read all of that, then (new message) give it instructions. This way that is written to the cache and you can reuse it even if you're editing your core prompt.
If you make this one message, it's a cache miss / write every time you edit.
You can edit 10 times for the price of one this way. (Due to cache pricing)
Is Claude caching by whole message only? Pretty sure OpenAI caches up to the first differing character.
Interesting. Claude places breakpoints. Afaik - no way to do mid message.
I believe (but not positive) there are 4 breakpoints.
1. End of tool definitions
2. End of system prompt
3. End of messages thread
4. (Least sure) 50% of the way through messages thread?
This is how I've seen it done in open source things / seems optimal based on constraints of anthropic API (max 4 breakpoints)
> It is much more token efficient
Is it? Aren't input tokens are like 1000x cheaper than output tokens? That's why they can do this memory stuff in the first place.
What I mean is that you want the total number of tokens to convey the information to the LLM to be as small as possible. If you’re having a discussion, you’ll have (perhaps incorrect) responses from the LLM in there, have to correct it, etc. All this is wasteful, and may even confuse the LLM. It’s much better to ensure all the information is densely packed in the original message.
They're around 10x cheaper than output, and 100x if they're cached.