Some people start their prompts with "Hello" or "Please" or something similar, out of some habitual sense of politeness, I think. It would be hilarious if those prompts really work better because the model can use those words as attention sinks.
Some people start their prompts with "Hello" or "Please" or something similar, out of some habitual sense of politeness, I think. It would be hilarious if those prompts really work better because the model can use those words as attention sinks.
One point that Karpathy has made in some of his videos is that using additional tokens in the prompt can facilitate computation. If you ask a transformer to do some basic math, it will be more likely to get the right answer (or at least a better approximation) with a more verbose prompt. To me, this backs up the use of more conversational language ("Please," etc.) when prompting.
However, that seems to be contradicted by what was shown recently with the successful International Math Olympiad effort. Their prompts, such as https://github.com/aw31/openai-imo-2025-proofs/blob/main/pro... , were very terse. It's hard to tell where the prompt stops and the CoT response starts, in fact.
So there is probably some interplay between the need for attention sinks and the use of step-by-step reasoning. It might not be too surprising if the latter works because it's an indirect way to optimize the former.
I wonder if the model could also just make its own sink tokens if the prompt doesn't have any. E.g. if the model first emits some "fluff" like "The answer to this question is:" before starting with the actual answer, it could use those tokens as attention sinks. Same with "thinking tokens" that don't directly contribute to the answer or invisible formatting tokens, etc.
Good thought, that indeed works: https://arxiv.org/abs/2310.02226
True, along with "You're absolutely right! What an insightful observation. You're going places, bro," yadda yadda yadda.
It would be amusing if all that gratuitous sycophancy actually helped with inference accuracy. It would also be worth treating that as a bug to be fixed, of course.
> It's hard to tell where the prompt stops and the CoT response starts, in fact.
That's because you're looking at the final output that includes neither the prompt nor the intermediate chain of thought.
Good point -- I can see that, but it all ends up in the same context, anyway. Point being, the model seems to prefer to conserve tokens.
That said, now I'm wondering if all those dashes it spews out are more than just window dressing.