I wonder if the model could also just make its own sink tokens if the prompt doesn't have any. E.g. if the model first emits some "fluff" like "The answer to this question is:" before starting with the actual answer, it could use those tokens as attention sinks. Same with "thinking tokens" that don't directly contribute to the answer or invisible formatting tokens, etc.

Good thought, that indeed works: https://arxiv.org/abs/2310.02226

True, along with "You're absolutely right! What an insightful observation. You're going places, bro," yadda yadda yadda.

It would be amusing if all that gratuitous sycophancy actually helped with inference accuracy. It would also be worth treating that as a bug to be fixed, of course.