It would be interestingi to know if AI is less likely to follow rules if the instructions provided to it contain foul or demeaning language. Too bad we couldn't re-play the scenario replacing NEVER F*ING GUESS! with:
**Never guess**
- All behavioral claims must be derived from source, docs, tests, or direct command output.
- If you cannot point to exact evidence, mark it as unknown.
- If a signature, constant, env var, API, or behavior is not clearly established, say so.
Underrated comment here. https://www.anthropic.com/research/emotion-concepts-function This study convinced me to be "nice" to AI agents. At least as I understood it, there's something in the weights that activating the "desperate" vector makes it more likely to cheat or cut corners. So yes I would err towards your suggested prompt over NEVER FUCKING GUESS.