Hacker News

Surely the system prompt is cached across accounts?

You can cache K and V matrices, but for such huge matrices you'll still pay a ton of compute to calculate attention in the end even if the user just adds a five word question.

pests 12 hours ago [ - ]

The state of the system can be cached after the system prompt is calculated and all new chats start from that state. O(n^2) is not great but apparently its fine at these context lengths and I'm sure this is a factor in their minimum prompt cost. Advances like grouped query or multi head attention or sparse attention will eventually get rid of that exponential, hopefully.

sigmoid10 7 hours ago [ - ]

That's not how it works. The system prompt doesn't "get calculated first" or anything. You combine it with the user prompt and then run the generation for the first new token on that thing, which basically boils down to one huge matmul that runs in parallel. So you can literally just cache a part of the input matrices for the first step and then you'll very quickly run into n^2 complexity.

cfcf14 a day ago [ - ]

I would assume so too, so the costs would not be so substantial to Anthropic.